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Objectives 

The  major  aims  of  the  proposed  project  were:  a)  development  of  quantitative  methods  for 
the  characterization  of  structure  spaces,  b)  application  of  these  newly  developed  methods 
in  selecting  analogs,  c)  development  of  estimation  methods  for  predicting  toxicologically 
relevant  properties  of  chemicals  from  their  analogs,  and  d)  development  of  neural  network 
methods  for  property  estimation  and  analog  selection. 


Status  of  Effort 

During  the  course  of  the  project,  most  of  our  work  focused  on  the  first  three  tasks  of  the 
project;  viz.,  a)  characterization  of  molecular  similarity  spaces,  b)  selection  of  analogs, 
and  c)  similarity-based  estimation  of  properties:  has  continued.  However,  in  the  last 
approximately  eighteen  months  the  focus  shifted  to  the  fourth  and  final  task  of  the 
project  -  the  application  of  neural  networks  in  property  estimation. 

In  the  area  of  Task  1 ,  the  effectiveness  of  theoretical  molecular  descriptors  vis-a- 
vis  experimental  physicochemical  properties  in  quantifying  intermolecular  similarity  has 
been  explored  for  several  sets  of  compounds  with  varying  physicochemical  and 
biological  properties.  In  Task  2,  the  various  structure  spaces  developed  in  Task  1  have 
been  used  in  the  selection  of  analogs  for  specific  probe  compounds.  In  Task  3,  we  have 
used  the  k-nearest  neighbor  (KNN)  method  to  estimate  properties  of  chemicals  from 
various  databases.  For  these  experiments,  k  has  been  varied  from  1-40.  The  results 
showed  that,  for  different  physicochemical,  toxicological  and  biochemical  properties, 
optimal  property  estimation  is  generally  obtained  in  the  range  of  k  =  5-10.  Finally,  in 
Task  4,  we  have  used  neural  networks  for  the  prediction  of  toxicological  endpoints.  In 
addition,  we  examined  several  methods  for  feature  (independent  variable)  selection 
using  a  machine  learning  technique  known  as  genetic  ensemble  feature  selection 
(GEFS)  which  is  based  on  genetic  algorithms.  The  results  show  that  neural  networks,  in 
general,  give  some  improvement  in  modeling  power  over  statistical  methods.  However, 
the  use  of  GEFS  to  select  relevant  features  for  modeling  greatly  improves  the 
performance  of  the  neural  networks. 


Accomplishments/New  Findings 

Described  below  are  the  accomplishments  of  the  four  project  tasks  that  have  been 
pursued  during  this  reporting  period. 

TASK  1 ;  Characterization  of  moiecular  simiiarity  spaces 

Molecular  similarity  spaces  were  constructed  using  computed  molecular  descriptors. 
These  descriptors  included  atom  pairs,  topological  indices,  geometrical  indices,  semi- 
empirical  quantum  chemical  parameters,  and  physicochemical  property  data  and  ab 
initio  quantum  chemical  parameters  when  available.  Atom  pairs  and  topological  indices 
were  calculated  using  in-house  software  packages,  APProbe  and  POLLY 2.3 
respectively;  geometrical  parameters  were  calculated  by  Sybyl  6.4  using  an  SPL  (Sybyl 
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Programming  Language)  program  developed  in-house;  and  the  quantum  chemical 
indices  were  calculated  by  MOPAC  6.00.  Additional  physicochemical  property  data 
were  taken  from  the  literature  and  ab  initio  calculations  were  conducted  using  Gaussian 
98W. 

As  part  of  this  task  we  have  also  begun  the  development  of  an  expanded  set  of 
molecular  descriptors.  For  adequate  characterization  of  molecular  similarity  spaces,  we 
must  be  sure  that  we  have  parameters  that  adequately  represent  the  pertinent 
molecular  features.  Currently,  we  have  added  nearly  one-hundred  additional  novel 
indices  to  our  predictor  set  (Pub.  #1)  and  we  plan  to  continue  to  expand  this  set  in  the 
future. 

Two  statistical  methods  were  used  to  derive  non-redundant  information  from  the 
calculated  parameters,  principal  components  analysis  (PCA)  and  variable  clustering 
(VC).  The  results  of  these  studies  has  been  reported  in  two  peer-reviewed  manuscripts 
(Pub.  #  2-4,  See  Publications  below)  and  in  chapters  in  two  books:  a  volume  of  the 
Discrete  Mathematics  and  Theoretical  Computer  Science  series  (Pub.  #  5)  and  volume 
2  of  the  Advances  in  Molecular  Similarity  series  (Pub.  #6). 

These  similarity  spaces,  constructed  from  theoretical  descriptors  and 
physicochemical  property  data,  are  distinct  in  the  sense  that  they  select  different  sets  of 
analogs  for  a  given  probe  chemical.  The  similarity  spaces  constructed  in  our  studies 
were  used  in  the  selection  of  analogs  and  estimation  of  toxicologically-relevant 
properties  for  diverse  sets  of  chemicals  (See  Task  2  and  3  below). 

Recently  we  have  created  several  similarity  spaces  for  the  identified  constituents 
of  JP-8.  Three  similarity  spaces  were  constructed  using  a  variety  of  descriptors: 
topological  indices,  atom  pairs,  and  physicochemical  descriptors.  This  information  was 
recently  reported  at  the  Air  Force  Office  of  Scientific  Research's  "JP-8  Jet  Fuel 
Toxicology  Workshop"  that  was  held  at  University  of  Arizona,  Tucson,  AZ,  Jan  11-12, 
2000.  These  studies  were  conducted  as  part  of  a  cluster-analysis,  rather  than  to  find 
analogous  chemicals  or  for  the  estimation  of  properties  (See  Task  2  for  further 
discussion). 

The  optimal  characterization  of  molecular  structure  is  prerequisite  to  the  creation 
of  useful  similarity  spaces  and  the  prediction  of  the  toxicity  of  chemicals  for  which  very 
little  experimental  data  is  available.  A  novel,  hierarchical  approach  was  used  in 
selecting  orthogonal  structural  information  from  calculated  topostructural,  topochemical, 
geometrical,  and  semi-empirical  quantum  chemical  descriptors.  The  resultant 
orthogonal  structural  information  was  used  to  develop  hierarchical  quantitative 
structure-activity  relationship  (QSAR)  models  for  predicting  properties  such  as  inhibition 
of  the  complement  system  by  benzamidines  and  the  dermal  penetration  of  polycyclic 
aromatic  hydrocarbons.  Results  of  this  research  have  been  reported  in  six  recent 
publications  (Pub.  #  7-12)  and  reviewed  in  a  chapter  (Pub.  #13)  of  the  book  Topological 
Indices  and  Related  Descriptors  in  QSAR  and  QSPAR.  This  hierarchical  approach  has 
also  been  employed  in  the  development  of  similarity  spaces  (Pub.  #3  &  6). 

TASK  2:  Selection  of  analogs 

The  similarity  spaces  created  using  the  atom  pair  (AP)  and  PCA  methods  were  used  in 
the  selection  of  analogs  for  probe  compounds.  In  one  study  (Pub.  #2),  five  distinct 
similarity  spaces  were  created  from:  a)  calculated  topostructural  indices  (TSI)  only,  b) 
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calculated  topochemical  indices  (TCI)  only,  c)  a  combination  of  both  TSI  and  TCI,  d) 
calculated  atom  pairs,  and  e)  physicochemical  property  data  taken  from  the  literature. 

In  another  study  (Pub.  #6),  three  distinct  similarity  spaces  were  created  from:  a) 
calculated  TSI  only,  b)  calculated  TCI  only,  and  c)  a  combination  of  both  TSI  and  TCI. 

The  former  of  these  studies  (Pub  #2)  attempted  to  quantify  the  degree  of  overlap 
between  similarity  measures.  To  this  end,  the  analogs  selected  for  a  set  of  76 
compounds  were  compared  and  the  methods  were  scored  in  a  pair-wise  fashion  to 
demonstrate  the  degree  of  overlap.  This  study  resulted  in  the  discovery  that,  for  this 
particular  set  of  compounds,  even  though  the  degree  of  overlap  between  the  groups  of 
analogs  selected  by  theoretical  descriptor  spaces  is  relatively  high,  the  similarity  space 
constructed  from  physicochemical  property  data  provided  relatively  unique  groups  of 
analogs. 

This  demonstrates  that  if  one  is  attempting  to  determine  the  optimal 
characterization  for  a  similarity  space  it  is  best  to  employ  two  or  three  distinct  methods, 
e.g.,  one  theoretical  space  and  one  property  space,  rather  than  two  theoretical  spaces 
that  may  have  a  high  degree  of  overlap. 

Further  investigation  is  needed  to  determine  which  of  these  similarity  techniques 
is  most  capable  of  estimating  toxicological  properties  of  chemicals  from  the  toxicity  data 
of  their  selected  neighbors.  It  should  also  be  noted  that  while  similarity  spaces  derived 
from  physicochemical  property  data  seem  to  be  unique  as  compared  to  theoretically- 
derived  similarity  spaces,  relevant  physicochemical  data  is  not  always  readily  available 
for  all  the  compounds  in  a  given  set.  In  technology  transfer,  this  finding  will  have 
important  implications.  Many  drug  companies  are  using  molecular  similarity  methods  in 
their  drug  discovery  process.  This  research,  one  aim  of  which  is  to  derive  molecular 
similarity  methods  which  are  non-redundant,  and  further  pursuit  of  this  issue  could  be 
beneficial  to  these  companies  and  others  involved  in  the  design,  synthesis,  and  testing 
of  new  chemicals. 

In  addition  to  the  selection  of  analogs,  similarity  spaces  can  be  used  in  cluster- 
analysis.  This  technique  assesses  the  molecular  similarity  and  examines  the  distances 
between  molecules  within  the  similarity  space  to  form  clusters  of  related  compounds. 
The  clusters  are  formed  around  a  central  point  (centroid)  and  have  a  set  radius  based 
on  the  molecular  density  around  the  centroid.  The  distance  from  the  cluster  centroid  to 
any  compound  within  that  cluster  can  be  measured,  telling  us  which  compounds  are 
nearest  the  centroid  and  which  compounds  are  furthest  from  the  centroid.  This  type  of 
study  is  useful  in  scanning  large  real  or  virtual  chemical  libraries  in  looking  for  new 
pharmaceutical  leads  or  for  other  testing  problems  in  which  the  number  of  compounds 
is  simply  too  large,  and  therefore  too  expensive,  to  subject  the  entire  set  to  proper 
toxicological  screening.  In  this  situation,  representatives  from  each  of  the  clusters  can 
be  tested  on  the  assumption  that  since  the  compounds  within  each  cluster  are  similar, 
their  properties  should  also  be  similar. 

Just  such  a  study  has  been  carried  out  on  194  of  the  isolated  compounds  in  JP- 
8.  The  three  similarity  spaces  (See  Task  1  above)  were  clustered  in  an  attempt  to 
determine  the  optimal  number  of  topological,  atom  pair  or  physicochemical  clusters  to 
be  used  on  a  set  of  nearly  200  chemicals  and  the  optimal  representation  for  this 
particular  set  of  compounds. 
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Further  research  coupled  with  analytical  testing  is  necessary  to  truly  determine 
the  optimal  method  for  the  creation  of  molecular  similarity  spaces  for  use  with  cluster- 
analysis.  However,  this  technique  promises  to  be  very  useful  in  simplifying  the  problems 
of  analyzing  mixture  toxicity,  as  in  the  case  of  JP-8,  and  in  the  analysis  and  pre¬ 
screening  of  large  virtual  chemical  libraries  in  search  of  new,  novel  drug  leads. 

TASK  3:  Similarity-based  estimation  of  properties 

Similarity  spaces  created  in  Task  1  have  been  used  in  the  estimation  of  properties 
using  the  /c-nearest  neighbor  (KNN)  method.  One  such  study  used  the  KNN  approach 
for  the  classification  of  a  set  of  1 13  compounds  as  mutagens  and  non-mutagens  (Pub. 
#5).  Both  the  AP  and  PCA  methods  were  employed  to  predict  mutagenic  activity  with 
comparable  results  for  both  methods. 

This  research  will  have  important  implications  both  in  computational  toxicology 
and  pharmaceutical  drug  discovery.  In  toxicology,  most  of  the  chemicals  in  commerce 
and  new  chemical  entities  do  not  have  the  data  necessary  for  proper  risk  assessment. 
Similarity  methods  can  be  used  in  the  quick  estimation  of  properties  in  such  cases. 
Combinatorial  chemistry,  which  produces  thousands  of  chemicals  per  week,  is  fast 
growing  as  the  method  for  drug  discovery  and  lead  optimization.  Only  certain  bioassays 
that  can  be  run  in  a  96-well  plate  at  micromolar  concentrations  are  carried  out  for  these 
new  compounds.  Few  if  any  of  these  chemicals  have  simple  property  data  such  as 
boiling  point  or  vapor  pressure,  let  alone  the  more  complex  pharmacokinetic  or 
pharmacodyanmic  data.  However,  all  of  these  chemicals  have  a  known  molecular 
structure.  Our  molecular  similarity  methods,  based  on  the  AP  or  PCA  methods,  or 
utilizing  the  hierarchical  approach  and  the  newly  developed  hierarchical  OSAR 
approach,  can  be  enormously  beneficial  in  such  situations  for  the  rapid  and  reasonable 
estimation  of  necessary  properties. 

TASK  4:  Application  of  neural  networks  in  property  estimation 

Neural  networks  have  been  constructed  for  the  estimation  of  acute  aquatic  toxicity 
(LCso)  in  fathead  minow  {Pimephales  promelas)  (Pubs.  #14-16).  In  the  first  two  studies, 
two  standard  backpropagation  neural  networks  were  constructed  for  the  esbmation  of 
toxicity:  a)  a  network  using  95  topological,  geometrical,  and  quantum  chemical 
parameters,  and  b)  a  network  using  a  subset  of  23  of  the  95  parameters  based  on  a 
statistical  method  for  variable  clustering  (VC)  (Pub.  #14  &  15).  The  performance  of 
these  models  was  on  par  with  the  performance  of  linear  statistical  methods  from  an 
earlier  study.  However,  the  neural  network  using  only  23  parameters  showed  a  slight 
improvement  in  model  performance  over  the  model  using  all  95  parameters. 

The  third  study  (Pub.  #16)  focused  on  the  use  of  a  machine  learning  technique, 
rather  than  traditional  statistical  approaches,  for  the  selection  of  a  reduced  set  of  model 
parameters.  Seeing  the  improvement  made  by  using  a  reduced  feature  set  (set  of 
molecular  descriptors)  in  our  first  two  studies,  we  decided  to  try  other  techniques  for 
limiting  the  feature  set.  This  study  compared  the  estimation  of  aquabc  toxicity  between 
three  models:  a)  a  neural  network  using  all  95  parameters,  b)  a  statistical  analysis  using 
23  parameters  selected  through  the  variable  clustering  procedure,  and  c)  a  neural 
network  utilizing  a  genetic  ensemble  feature  selection  (GEFS)  algorithm.  The  neural 
network  using  the  GEFS  algorithm  developed  by  David  Opitz  showed  significant 
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improvement  over  both  the  linear  statistical  model  and  the  "standard"  neural  network 
model. 
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descriptors:  a  hierarchical  QSAR  approach,  B.D.  Gute  and  S.C.  Basak,  SAR 
QSAR  Environ.  Res.,  7,  117-131,  1997. 

12.  Use  of  topostructural,  topochemical  and  geometric  parameters  in  the  prediction 
of  vapor  pressure:  a  hierarchical  QSAR  approach,  S.C.  Basak,  B.D.  Gute  and 
G.D.  Grunwald,  J.  Chem.  Inf  Comput  Sci.,  37,  651-655,  1997. 

13  A  hierarchical  approach  to  the  development  of  QSAR  models  using  topological, 
geometrical  and  quantum  chemical  parameters,  S.C.  Basak,  B.D.  Gute  and  G.D. 
Grunwald,  in:  Topological  Indices  and  Related  Descriptors  in  QSAR  and 
QSPAR,  eds.  J.  Devillers  and  A.T.  Balaban,  Gordon  and  Breach:  Reading,  UK, 

in  press,  1999. 

14  Use  of  statistical  and  neural  net  methods  in  predicting  toxicity  of  chemicals:  A 
hierarchical  QSAR  approach,  S.C.  Basak,  B.D.  Gute,  G.D.  Grunwald,  D.W.  Opitz 
and  K  Balasubramanian,  in:  Predictive  Toxicology  of  Chemicals:  Expenences 
and  Impact  of  A!  Tools  -  Papers  from  the  1999  AAAI  Symposium,  AAAI  Press, 
Menlo  Park,  CA,  1999,  p  108-111. 

1 5  Use  of  statistical  and  neural  net  approaches  in  predicting  toxicity  of  chemicals, 
S.C.  Basak,  G.D.  Grunwald,  B.D.  Gute,  K.  Balasubramanian,  and  D.  Opitz,  J. 
Chem.  Inf  Comput  Sci.,  submitted,  1999. 
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16  Hazard  assessment  modeling:  An  evolutionary  ensemble  approach,  D.W.  Opitz, 
S  C  Basak,  and  B.D.  Gute,  in:  GECCO-99:  Proceedings  of  the  Genetic  and 
Evolutionary  Computation  Conference,  eds.  W.  Banzhaf,  J.  Daida,  A.E.  Eiben, 
M.H.  Garzon,  V.  Honavar,  M.  Jakiela,  &  R.E.  Smith,  Morgan  Kaufmann:  San 
Francisco,  accepted,  1999. 

Copies  of  all  manuscripts  have  been  attached  as  the  Appendices. 


Interactions/Transitions 

Participation/Presentations 

1  Subhash  Basak  gave  an  invited  presentation  "Exploring  the  scientific  basis  of 
Ayurvedic  Medicine:  A  computatioal  approach"  at  the  conference  "Beyond 
Conventional  Healthcare:  Understanding  Alternative  Choices"  organized  by  the 
University  of  Wisconsin,  Superior,  November  12-13,  1999. 

2.  Subhash  Basak  gave  an  invited  presentation  on  “Development  of  hierarchical 
QSAR  models  for  predicting  toxicity  of  chemicals:  statistical  and  neural  net 
approaches"  at  the  Air  Force  Predictive  Toxicology  Conference,  Wright  Patterson 
Air  Force  Base,  Dayton,  OH,  October  7,  1999. 

3.  Subhash  Basak  gave  the  following  invited  research  presentations/  invited 
seminars  during  his  trip  to  Europe  and  India: 

a)  "A  hierarchical  QSAR  approach  for  predating  property/activity  of  chemical  from 
structure"  at  the  Rugjer  Boskovic  Institute,  Zagreb,  The  Republic  of  Croatia, 
August  26,  1999 

b)  "Predicting  property/activity/toxicity  of  chemicals  from  structure;  A  hierarchical 
QSAR  approach"  at  the  National  Institute  of  Chemistry,  Slovenia,  August  30, 

1999 

c)  "Prediction  of  activity/toxicity  of  chemicals  from  structure  using  graph  invariants 
at  Visva  Bharati  University,  Santiniketan,  West  Bengal,  India,  September  9,  1999 

d)  "Clustering  of  Psoralen  Derivatives  using  Topological  Invariants:  a  strategy  for 
molecular  design"  presented  at  the  13th  International  Biophysics  Congress,  New 
Delhi  September  19-24,  1999,  authored  jointly  by  Subhash  C.  Basak,  Gregory 
D.  Gr’unwald,  Alexandru  T.  Balaban  (Polytechnic  University,  Romania)  and 
Kanika  Basak  (St.  Xavier’s  Computer  Center,  Calcutta,  India) 

e)  "A  Hierarchical  QSAR  Approach  to  Predicting  Bioactivity  of  Chemicals  using 
Theoretical  Molecular  Descriptors"  presented  at  the  13th  International  Biophysics 
Congress,  New  Delhi,  September  19-24,  1999,  authored  jointly  by  Subhash  C. 
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Basak,  Brian  D.  Gute,  Denise  Mills,  Gregory  D.  Grunwald,  David  Opitz 
(University  of  Montana,  Mizoulla),  and  Krishnan  Balasubramanian  (Dept,  of 
Chemistry  and  Biochemistry,  Arizona  State  University,  Tempe,  AZ) 

f)  "Modeling  the  Solubility  of  Aliphatic  Alcohols  in  Water;  Graph  Connectivity 
Indices  versus  Line  Graph  Connectivity  Indices"  presented  at  the  13th 
International  Biophysics  Congress,  New  Delhi,  September  19-24,  1999,  authored 
jointly  by  Dragan  Amic  (The  Rugjer  Boskovic  Institute,  Croatia),  Subhash  C. 
Basak,  Drago  Beslo  (Croatia),  Sonja  Nikolic  (The  Rugjer  Boskovic  Institute, 
Croatia)  and  Nenad  Trinajstic  (The  Rugjer  Boskovic  Institute,  Croatia) 

g)  "Design  of  High  Quality  Structure-Property  Regressions"  presented  at  the  13th 
International  Biophysics  Congress,  New  Delhi,  September  19-24,  1999,  authored 
jointly  by  Milan  Randic  (Drake  University,  lA)  and  Subhash  C.  Basak 

h)  "On  Numerical  Characterization  of  DNA  Primary  Sequences,  presented  at  the 
13th  International  Biophysics  Congress,  New  Delhi,  September  19-24,  1999, 
authored  jointly  by  Milan  Randic  (Drake  University),  Marjan  Vracko  (National 
Institute  of  Chemistry,  Slovenia),  Ashesh  Nandy  (Indian  Institute  of  Chemical 
Biology,  Calcutta,  India)  and  Subhash  C.  Basak, 

i)  "Predicting  biomedicinal  and  toxicological  properties  of  chemicals  using 
molecular  descriptors"  at  the  University  of  Delhi,  India,  September  24,  1999 

j)  "The  utility  of  Ayurvedic  medicine  for  modern  drug  discovery:  An  exploratory 
analysis"  at  the  conference  organized  by  the  East  India  Pharmaceutical 
Company,  Calcutta,  September  29,  1999 

4.  Subhash  Basak  presented  the  following  papers  at  the  QSAR  Gordon 
Conference,  July  25-30,  1999,  Tilton,  New  Hampshire: 

a)  A  hierarchical  QSAR  approach  for  predicting  property/activity  of  chemicals, 
authored  by  Basak,  Greg  Grunwald,  Brian  Gute,  Denise  Mills,  Krishnan 
Balasubramanian  (Department  of  Chemistry  and  Biochemistry,  Arizona  State 
University,  Tempe,  Arizona),  and  Alexandru  Balaban  (Polytechnic  University, 
Bucharest,  Romania) 

b)  Topological  indices  as  molecular  descriptors  for  QSAR,  authored  by  Balaban 
and  Basak 

5.  Subhash  Basak  and  Milan  Randic,  a  Distinguished  Professor  of  Mathematics 
and  Computer  Science  at  Drake  University,  Iowa,  and  a  Visiting  Scientist  at 
NRRI,  jointly  organized  a  one  day  Workshop  on  Applied  Mathematical 
Chemistry;  Molecular  Descriptors  and  Their  Applications  in  Structure-Property- 
Activity-Toxicity  Relationship,  May  3,  1999,  at  NRRI.  Thirteen  speakers  from 
seven  different  countries,  viz.,  Bulgaria,  Croatia,  India,  Romania,  Slovenia, 
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United  Kingdom  and  United  States,  gave  invited  presentations  on  their  latest 
research  on  Mathematical  Chemistry,  Quantitative  Structure  Activity 
Relationships  (QSAR),  Computational  Chemistry  and  Predictive  Toxicology.  Dr. 
Michael  J.  Lalich,  Director  of  NRRI,  welcomed  the  guests  and  Dr.  Vincent 
Magnuson,  Vice  Chancellor  for  Academic  Administration  at  UMD  inaugurated  the 
workshop. 

6.  Brian  Gute  attended  Annual  American  Chemical  Society  meeting,  March  21-25, 
1999,  Anaheim,  CA. 

7.  Subhash  Basak  gave  the  following  invited  presentations  on  QSAR/  predictive 
toxicology: 

a)  “A  computational  approach  to  predicting  toxicity  and  toxic  modes  of  action  of 
chemicals  from  structure”  at  the  International  Conference  ‘Smarter  Lead 
Optimization:  easing  the  bottleneck’  organized  by  Cambridge  Health  Institute, 
March  18-19,  1999,  San  Diego,  CA 

b)  “Topological  indices  as  molecular  descriptors  for  lead  optimization”  authored 
jointly  by  Alexandru  T.  Balaban  and  Subhash  C.  Basak,  at  the  International 
Conference  ‘Smarter  Lead  Optimization:  easing  the  bottleneck’  organized  by 
Cambridge  Health  Institute,  March  18-19,  1999,  San  Diego,  CA 

c)  “Use  of  statistical  and  neural  net  methods  in  predicting  toxicity  of  chemicals:  a 
hierarchical  QSAR  approach”  authored  jointly  by  Subhash  C.  Basak,  Gregory  D. 
Grunwald,  Brian  D.  Gute,  K.  Balasubramanian  (Department  of  Chemistry  and 
Biochemistry,  Arizona  State  University,  Tempe,  AZ,  and  David  Opitz 
(Department  of  Computer  Science,  University  of  Montana,  Missoula,  Montana)  at 
the  American  Association  of  Artificial  Intelligence  (AAAI)  conference,  "Predictive 
Toxicology  of  Chemicals:  Experiences  and  Impact  of  Al  Tools,  Stanford 
University  (CA),  March  22-24,  1999 

d)  “A  Graphical  Technique  for  Preliminary  Assessment  of  Effects  on  DMA 
Sequences  from  Toxic  Substances”  authored  jointly  by  A.  Nandy  (Indian  Institute 
of  Chemical  Biology,  Calcutta,  India),  C.  Raychaudhury  (IICB,  Calcutta,  India) 
and  Subhash  Basak  at  the  American  Association  of  Artificial  Intelligence  (AAAI) 
conference,  "Predictive  Toxicology  of  Chemicals:  Experiences  and  Impact  of  Al 
Tools,"  Stanford  University  (CA),  March  22-24,  1999 

8.  Brian  Gute  attended  Annual  Society  of  Toxicology  meeting,  March  13-17, 1999, 
New  Orleans,  LA. 

9.  Subhash  Basak  gave  the  following  invited  lectures/  presentations: 

a)  The  first  distinguished  lecture  in  Mathematical  Chemistry  on  “From  Graph 
Invariants  to  Molecular  Design:  25  years  after  the  connectivity  index  at  Visva 
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Bharati  University,  Santiniketan,  West  Bengal,  India,  February  1 1 ,  1999 

b)  An  invited  seminar  on  “Theoretical  molecular  descriptors  for  the  prediction  of 
bioactivity,  toxicity,  selection  of  analogs,  discovery  and  optimization  of  leads  at 
the  Wockhardt  Research  Centre,  Aurangabad,  Maharashtra,  India,  on  February 
15,  1999 

c)  An  invited  lecture  on  “Prediction  of  bioactivity  of  chemicals  from  structure:  a 
hierarchical  computational  approach”  at  Bharatiya  Vidya  Bhavan’s  Swami 
Prakashananda  Ayurvedic  Research  Center,  Mumbai,  India,  on  February  18, 

1999 

d)  An  invited  lecture  on  “Toxicology  in  silico:  addressing  the  quagmire  of 
environmental  pollution  and  protecting  public  health  using  computational 
chemistry,”  authored  jointly  by  Subhash  C.  Basak,  Brian  D.  Gute  David  Opitz 
(Computer  Science  Department,  University  of  Montana,  Missoula)  and  Gregory 
D.  Grunwald  at  the  International  Symposia  Series:  Reducing  the  Environmental 
Impacts  of  Toxic  Chemicals  in  Asian  Economies.  The  Impacts  of  Toxic 
Chemicals  and  Pollutants  on  Public  Health,  the  Ecology  and  the  Environment  of 
the  Bengal  Basin  -  Bangladesh  and  India  ,  Dhaka  Bangladesh,  on  March  1,  1999 

e)  An  invited  seminar  on  “Novel  drug  discovery  methods:  predicting 
pharmacological  and  toxicological  properties  of  chemicals  using  computational 
chemistry”  at  the  School  of  Pharmacy,  Dhaka  University,  Dhaka,  Bangladesh  on 
March  4,  1999 

f)  An  invited  talk  on  “Computational  toxicology:  a  cost  effective  approach  for  the 
protection  of  human  and  environmental  health”  at  the  International  Conference  at 
Santiniketan,  India,  March  7, 1999 

g)  An  invited  presentation  “Estimation  of  DNA  Damage  from  Toxic  Chemicals  by 
Graphical  Techniques”  authored  jointly  by  Ashesh  Nandy  (Head  of  the  Computer 
Division,  Institute  of  Chemical  Biology  (IICB),  Calcutta,  India),  C.  Raychaudhury 
and  S.  Ghosh,  Research  Scientists  at  IICB  and  Subhash  Basak  on  March  8, 

1999 


10.  Subhash  Basak  gave  an  invited  lecture  on  "Novel  Drug  Discovery  Methods: 
Predicting  pharmacological  and  toxicological  properties  of  chemicals  using 
computational  chemistry"  at  the  Meharry  Medical  College,  Nashville,  TN,  January 
19,  1999. 

1 1 .  Subhash  Basak  had  a  site  visit  to  the  Molecular  Anatomy  Laboratory, 

Department  of  Biology,  Indiana  University  Purdue  University,  of  Indiana, 
Columbus,  IN,  January  12-16,  1999,  as  part  of  the  US  Air  Force  Predictive 
toxicology  program,  to  discuss  the  use  of  proteomics  in  the  development  of 
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QSAR  models  for  JP8  jet  fuel  with  colleagues  from  University  of  Minnesota,  TC 
campus,  University  of  Montana,  Missoula  and  lUPUI. 

12.  Subhash  Basak  gave  an  invited  presentation  “Clustering  of  JP-8  constituents  into 
structurally  dissimilar  groups:  a  novel  computational  strategy  for  predictive 
toxicology”,  authored  jointly  by  Basak  and  Greg  Grunwald,  at  the  Air  Force  Office 
of  Scientific  Research  JP-8  Jet  Fuel  Toxicology  Workshop,  held  at  the  University 
of  Arizona,  Tucson,  AZ,  December  2-3, 1998. 

1 3.  Brian  Gute  presented  an  invited  talk  “A  hierarchical  QSAR  approach  to  predicting 
carcinogenicity  of  chemicals”  authored  jointly,  by  Subhash  Basak,  Gute  and 
Greg  Grunwald,  at  the  19*^  Annual  Society  of  Environmental  Toxicology  and 
Chemistry  meeting,  Charlotte,  North  Carolina,  November  15-19,  1998 

14.  Subhash  Basak  presented  the  following  invited  lectures: 

a)  “Theoretical  molecular  descriptors  for  the  prediction  of  bioactivity/toxicity, 
selection  of  analogs,  discovery  and  optimization  of  leads”  authored  jointly  by 
Basak,  Brian  Gute,  Gregory  Grunwald,  and  Alexandru  T.  Balaban  (Professor  of 
Organic  Chemistry  at  the  Polytechnic  University,  Bucharest,  Roumania)  at  the 
Astra  Symposium  on  “Advance  in  Medicinal  Chemistry”  organized  by  the  Astra 
company.  Bangalore,  September  17-19, 1998. 

b)  “Prediction  of  bioactivity  of  chemicals  from  structure:  a  computational  approach” 
at  the  Indian  Institute  of  Science,  Bangalore,  India,  September  20,  1 998. 

c)  “Integration  of  traditional  Indian  medicine  and  chemoinformatics  for  rapid  drug 
discovery”  at  the  conference  organized  jointly  by  East  India  Pharmaceutical 
Company,  Calcutta,  October  12,  1998. 

1 5.  Subhash  Basak  attended  the  Annual  American  Chemical  Society  meeting, 

August  23-27,  1998,  Boston,  Massachusetts. 

16.  Dr.  S.C.  Basak  presented  the  invited  lecture  “Use  of  theoretical  structural 
descriptors  in  molecular  design  and  hazard  assessment  of  chemicals”  to  the 
scientists  of  the  computer-aided  drug  design  company  NANODESIGN,  INC. 
Toronto,  Canada,  July  6, 1998. 

17.  Dr.  S.C.  Basak  presented  an  invited  seminar  “Novel  Drug  Design  Methods: 
assessing  activity  and  toxicity  using  computational  chemistry”  at  the  Department 
of  Molecular  Biology  and  Genetics,  University  of  Guelph,  Ontario,  Canada,  July 
3,  1998. 

18.  Dr.  S.C.  Basak  presented  a  paper  "Dissimilarity-based  clustering  of  psoralen 
derivatives  in  the  topological  structure  space:  a  strategy  for  drug  design"  at  the 
Second  Annual  Chemoinformatics  Workshop,  organized  by  the  Cambridge 
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Health  Institute.  Boston,  MA,  June  15-16,  1998.  The  paper  was  co-authored  by 
G.D  Grunwald  and  B.D.  Gute. 

19.  Dr.  S.C.  Basak  presented  the  following  papers  at  the  International  Conference 
"Computational  Methods  in  Toxicology"  held  April  20-22,  1998,  Dayton,  OH: 

a)  “Use  of  computational  methods  in  predicting  potential  toxicity  of  chemicals, 
authored  jointly  by  S.C.  Basak,  B.D.  Gute  and  G.D.  Grunwald. 

b)  "On  construction  of  optimal  molecular  descriptors,"  authored  jointly  by  M.  Randic 
and  S.C.  Basak. 

c)  "Predicting  mode  of  action  of  chemicals  from  structure:  a  hierarchical  approach," 
authored  jointly  by  S.C.  Basak,  G.D.  Grunwald  and  B.D.  Gute. 

d)  "A  hierarchical  approach  to  predictive  toxicology  using  computed  molecular 
descriptors,"  authored  jointly  by  B.D.  Gute,  G.D.  Grunwald  and  S.C.  Basak 

20.  Dr.  S.C.  Basak  gave  an  invited  presentation  entitled  “A  computational  approach 
to  predicting  toxicity:  Possible  applications  to  JP8  jet  fuel”  at  the  First 
International  Conference  on  the  Environmental  Health  and  Safety  of  Jet  Fuels, 
organized  jointly  by  US  Air  Force,  National  Institute  of  Occupational  Safety  and 
Health,  USEPA  National  Exposure  Research  Laboratory  and  American  Industrial 
Hygiene  Association,  April  1-3,  1 998,  San  Antonio,  TX. 

21  Dr  S  C.  Basak  chaired  a  session  at  the  DIMACS  Workshop  on  Discrete 

Mathematical  Chemistry,  March  23-25,  1 998,  held  at  Rugters  University,  New 
Jersey.  He  also  presented  an  invited  paper  entitled  “Use  of  graph  invariants  in 
QSAR  and  predictive  toxicology”  at  the  conference  authored  jointly  by  S.C. 

Basak,  B.D.  Gute  and  G.D.  Grunwald. 

22.  Dr.  S.C.  Basak  gave  several  invited  lectures  at  various  national  and  international 
symposia: 

a)  A  distinguished  lecture  “Rational  drug  design  and  Ayurvedic  medicine”  at  the 
conference  organized  by  the  Association  of  Ayurvedic  Doctors  of  India  (AADI), 
January  4,  1998. 

b)  An  invited  lecture  on  “Use  of  computational  methods  and  Ayurvedic  knowledge 
in  modern  drug  discovery”  at  the  conference  AYURVEDA  TODAY,  January  8, 
1998. 

c)  An  invited  seminar  on  “Assessment  of  genotoxicity  of  chemicals  from  structure:  a 
computational  approach”  at  the  Annual  Conference  of  the  Indian  Association  for 
Cancer  Congress,  Calcutta,  January  21-24,  1998,  B.D.  Gute  and  G.D.  Grunwald. 
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23.  Dr.  S.C.  Basak  was  the  Co-Chairperson  of  the  First  Indo/US  Workshop  on 
Mathematical  Chemistry,  organized  jointly  by  NRRI  and  Visva  Bharati  University. 
Santiniketan,  West  Bengal  India,  Jan  9-13, 1998.  Basak  presented  the  following 
papers  at  the  workshop; 

a)  “Graph  invariants,  molecular  similarity  and  QSAR”  coauthored  by  B.D.  Gute  and 
G.D.  Grunwald. 

b)  “Weighted  paths  as  novel  optimal  molecular  descriptors"  authored  jointly  by  M. 
Randic,  President,  International  Society  for  Mathematical  Chemistry  and  S.C. 
Basak. 

c)  “The  utility  of  hierarchical  model  development  in  examining  the  structural  basis  of 
properties”  authored  by  B.D.  Gute,  G.D.  Grunwald  and  S.C.  Basak. 

d)  “Weighted  K-nearest  neighbors  property  estimation  in  molecular  similarity” 
authored  by  G.D.  Grunwald,  B.D.  Gute  and  S.C.  Basak. 

e)  “Dissimilarity  based  clustering  of  psoralen  derivatives  in  the  topological  structure 
space:  a  strategy  for  drug  design”  authored  by  S.C.  Basak,  G.D.  Grunwald,  D. 
Panja,  K.  Basak  and  B.D.  Gute. 

24.  Subhash  C.  Basak  presented  an  invited  lecture  entitled  “Predicting  bioactivity  of 
chemicals  from  structure;  a  hierarchical  QSAR  approach”  to  the  Department  of 
Biochemistry,  University  of  Calcutta,  Calcutta,  India,  July  30,  1997. 

25.  Subhash  C.  Basak  presented  an  invited  lecture  entitled  “Prediction  of 
physicochemical  and  toxicological  properties  of  chemicals  using  theoretical 
molecular  descriptors”,  at  Moscow  State  University,  Moscow,  Russia,  June 
30.1997. 

26.  Subhash  C.  Basak,  Brian  D.  Gute,  and  Greg  D.  Grunwald  presented  an  invited 
paper  entitled  "Use  of  theoretical  molecular  descriptors  in  structure-property  and 
structure-activity  studies"  at  the  7“’  International  Conference  on  Mathematical 
Chemistry  and  3^'*  Girona  Seminar  on  Molecular  Similarity,  Girona,  Spain,  May 
26-31.  1997. 

27.  Subhash  C.  Basak,  Brian  D.  Gute  and  Greg  D.  Grunwald  presented  an  invited 
paper  entitled  "Use  of  nonempirical  structural  descriptors  in  QSAR"  in  the 
session  "Mathematical  approaches  to  QSAR  and  predictive  toxicology"  of  the 
11’*’  International  Conference  on  Mathematical  and  Computer  Modelling  and 
Scientific  Computing  in  Washington,  DC,  March  27-April  3,  1997, 

28.  Subhash  C.  Basak  presented  a  seminar  “Computational  chemical  graph  theory 
and  its  practical  applications"  in  the  Scientific  Computing  Seminar  Laboratory  for 
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Intelligent  Systems  -  ECE  Dept.and  CSc  Dept.  University  of  Minnesota,  Duluth 
on  January  29,  1997. 

29.  Subhash  C.  Basak  gave  a  presentation  “Development  of  QMSA  and  QSAR 
methods  for  hazard  assessment  of  chemicals:  tools  for  computational  toxicology” 
at  the  Air  Force  Office  of  Scientific  Research  (AFOSR)  Toxicology  Program 
Review,  December  12-13,  1996,  Fairborn,  Ohio. 

30.  Subhash  C.  Basak  and  Brian  D.  Gute  gave  an  invited  presentation  "Quantitative 
Molecular  Similarity  Analysis  (QMSA)  and  Toxicity  Prediction"  at  the  US  Air 
Force  Conference  "Chemistry  and  Toxicology  of  Candidate  Deicers"  organized 
by  the  Materials  Directorate  of  Wright  Patterson  Air  Force  Base  (WPAFB), 
Dayton,  OH. 

31 .  Brian  D.  Gute,  Subhash  C.  Basak  and  Greg  D.  Grunwald  presented  a  paper 
“Development  of  QSARs  of  bioactiye  molecules  using  a  hierarchical  approach”  at 
the  31®‘  Midwest  Regional  meeting  of  the  American  Chemical  Society,  November 
6-8,  1996. 

32.  Subhash  C.  Basak  presented  a  seminar  “QSAR/QMSA  using  nonempirical 
parameters:  applications  in  predictive  toxicology  and  drug  discovery”  at  the 
Abbott  Laboratories,  Chicago,  September  22-23,  1996. 

33.  Subhash  C.  Basak  and  Brian  Gute  presented  an  invited  lecture  at  the 
international  symposium  organized  for  the  1995  Herman  Skolnick  award  in 
chemical  information.  The  symposium  was  held  during  the  American  Chemical 
Society  meeting,  Orlando,  Florida,  August  25-29,  1996. 


Consultative  and  Advisor  Functions 
None 


Transitions 

1 .  Applied  computational  methods  in  the  design  a  set  of  six  anti-epileptic 
carbamates  by  Professor  Alexandru  T.  Balaban,  Vice  President,  Rumanian 
Academy  of  Sciences. 

2.  Worked  with  Dr.  James  Riviere,  North  Carolina  State  University,  in  the  clustering 
of  JP-8  components  using  dissimilarity  methods  developed  at  NRRI. 

3.  Worked  with  Professor  George  Mushrush,  Department  of  Chemistry,  George 
Mason  University,  Washington  D.C.,  in  the  application  of  similarity  and  QSAR 
methods  in  the  design  of  novel  and  benign  deicing  agents. 
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New  Discoveries 

1  An  in-depth  study  of  similarity  space  construction  and  analog  selection  resulted 
in  the  discovery  that  for  a  particular  set  of  compounds  the  degree  of  overlap 
between  the  groups  of  analogs  selected  by  theoretical  descriptor  spaces  is 
relatively  high.  This  study  also  revealed  that  a  similarity  space  constructed  from 
physicochemical  property  data  provided  relatively  unique  sets  of  analogs  as 
compared  to  those  selected  from  the  theoretically-derived  similarity  spaces. 

2  Hierarchical  QSAR  research  using  topostructural,  topochemical,  and  geometrical 
parameters  showed  that  the  first  two  classes  of  parameters  explain  most  of  the 
variance  in  the  data  of  toxicological  and  physicochemical  properties. 

3.  It  was  observed  that  similarity  spaces  derived  from  topostructural  and 
topochemical  parameters  have  distinct  analog  selection  characteristics. 


Honors/Awards 

1  Dr.  S.C.  Basak  chaired  a  session  at  the  DIMACS  Workshop  on  Discrete 
Mathematical  Chemistry,  March  23-25,  1998,  held  at  Rugters  University,  New 

Jersey. 

2  Dr.  S.C.  Basak  was  the  Co-Chairperson  of  the  First  Indo/US  Workshop  on 
Mathematical  Chemistry,  organized  jointly  by  NRRl  and  Visva  Bharati  University, 
Santiniketan,  West  Bengal  India,  Jan  9-13,  1998. 

3.  Subhash  C.  Basak  was  invited  to  present  a  lecture  on  molecular  similarity  at  the 
7**’  International  Conference  on  Mathematical  Chemistry  and  3  Girona  Seminar 
on  Molecular  Similarity,  Girona,  Spain,  May  26-31 ,  1997 . 

4.  Subhash  C.  Basak  was  invited  to  become  a  member  of  the  Organizing  and 
Scientific  Committee  of  for  future  meetings  of  the  International  Conference  on 
Mathematical  and  Computer  Modelling  and  Scientific  Computing. 

5  Subhash  C.  Basak  chaired  and  organized  two  sessions  at  the  1 1*''  International 
Conference  on  Mathematical  and  Modelling  and  Scientific  Computing,  March  31- 
April  3, 1997,  Georgetown  University,  Washington,  DC. 

6.  Subhash  C.  Basak  was  invited  to  become  one  of  six  invited  speakers  at  the 
international  symposium  organized  for  the  1995  Herman  Skolnick  award  in 
chemical  information.  The  symposium  was  held  during  the  American  Chemical 
Society  meeting,  Orlando,  Florida,  August  25-29,  1996,  to  honor  Milan  Randic, 
the  recipient  of  1995  Herman  Skolnic  award. 
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Appendix  1, 1  Use  of  statistical  and  neural  net  approaches 

In  predicting  toxicity  of  chemicals 


Journal  of  Chemical  Information  and 
Computer  Sciences,  submitted 


Use  of  Statistical  and  Neural  Net  Approaches  In  Predicting  Toxicity  of  Qhemicals 

Subhash  C.  Basak,  Gregory  D.  Grunwald,  Brian  D.  Gute,  Krishnan  Balasubramanian’  and  David  Opitz 
Natural  Resources  Research  Institute.  University  of  Minnesota  Duluth.  Duluth.  Minnesota  55811 
’Department  of  Chemistry  and  Biochemistry.  Arizona  State  University.  Tempe.  Arizona  85287-1604 
^Department  of  Computer  Science,  University  of  Montana,  Missoula,  Montana  59812 


We  have  been  involved  in  the  development  of  a  new  hierarchical  quantitative  structure- 
activity  relationship  (H-QSAR)  approach  in  predicting  physicochemical,  biomedicinal  and 
toxicological  properties  of  various  sets  of  chemicals.  This  approach  uses  increasingly 
more  complex  molecular  descriptors  for  model  building  in  a  graduated  manner. 

In  this  paper  we  will  apply  statistical  and  neural  net  methods  in  the  development  of 
QSAR  models  for  predicting  the  aquatic  toxicity  of  sixty-nine  benzene  derivatives  using 
topostructural,  topochemical,  geometrical,  and  quantum  chemical  indices.  The  utiiity  and 
limitations  of  the  approach  will  be  discussed. 


1 .  INTRODUCTION 

An  important  aspect  of  modern 
toxicological  research  is  the  prediction  of 
toxicity  of  xenobiotics  and  environmental 
pollutants  from  their  structure.’  The 
potential  toxicity  of  chemicals  is  usually 
assessed  from  a  plethora  of  relevant 
physical  and  biological  properties.  Table  1 
provides  a  partial  list  of  such  properties. 
Such  toxicological  indicators  usually  try  to 
predict  complex  toxicity  endpoints  of 
chemicals  to  humans  and  the  environment 
using  simpler  and  relevant  properties.  A 
perusal  of  the  combinatorics  of  the  situation 
shows  that  the  problem  is  astronomical.  The 
Toxic  Substances  Control  Act  (TSCA) 
Inventory  currently  has  about  80,000 
structures  most  of  which  do  not  have  data 
for  the  toxicologically  relevant  properties 
mentioned  in  Table  1.  In  fact,  about  50%  of 
these  chemicals  do  not  have  any 
experimental  property  data  at  all. 
Worldwide,  more  than  16  million  chemicals 


are  known,  as  is  evident  from  the  number  of 
entries  in  the  Chemical  Abstract  Service 
(CAS)  inventory.’®  For  most  of  these 
chemicals  we  do  not  have  the  data 
necessary  for  risk  assessment.  Modern 
combinatorial  chemistry  has  been  producing 
large  libraries  of  chemicals  at  a  very  rapid 
rate.  Most  of  these  substances  have  none 
of  the  test  data  needed  for  their  hazard 
estimation. 

In  recent  years,  there  have  been  efforts 
by  the  chemical  industry  and  government 
agencies  to  develop  reliable  databases  of 
properties  that  might  be  used  for  hazard 
estimation.’®  This  effort,  although 
commendable,  falls  short  of  the  need;  and 
the  picture  will  remain  so  in  the  foreseeable 
future.  In  the  area  of  molecular  biology, 
innovative  techniques  are  emerging  where 
specially  engineered  cell  lines  can  be  used 
to  detect  the  activity  or  toxicity  of  chemicals 
to  the  genetic  system.’^'’®  Effects  of 
chemicals  on  the  pattern  of  cellular  proteins, 
analyzed  by  proteomics  technology,  are 
being  used  to  detect  their  potential  toxic 


effects.“'“  Such  methods  are  faster  than 
the  traditional  methods  and  can  save  large 
number  of  test  animals.  At  present,  neither 
the  available  test  data  nor  the  combination 
of  in  vitro  toxicity  testing  methods  provide 
adequate  resources  for  hazard  assessment. 

Quantitative  structure-activity/toxicity 
relationship  (QSAR/QSTR)  models  have 
emerged  as  useful  tools  to  handle  the  data 
gap  in  toxicology  and  pharmacology. 

Such  models  can  be  used  to  estimate 
complex  properties  of  chemicals  from 
simpler  experimental  or  computed 
properties.  In  view  of  the  fact  that  most 
chemicals  in  commence  and  environmental 
pollutants  have  very  little  test  data,  it  would 
be  desirable  if  we  could  develop 
toxicologically-relevant  QSARs  from 
properties  that  can  be  calculated  directly 
from  a  chemical's  molecular  structure.  In 
some  of  our  recent  papers  we  have 
developed  a  novel  hierarchical  QSAR 
approach  where  four  classes  of  theoretical 
molecular  descriptors,  viz.,  topostructural, 
topochemical,  geometrical,  and  quantum 
chemical  parameters,  have  been  used 
sequentially  in  the  formulation  of  QSAR 
models  for  predicting  physical,  biomedicinal, 
and  toxicological  properties. 

Most  of  our  hierarchical  QSARs  are 
based  on  linear  statistical  methods  such  as 
multiple  linear  regression,  principal 
components  analysis  (PCA)  and  variable 
clustering.  Such  methods  yield  useful 
models;  but  they  suffer  from  the  limitation 
that  in  some  cases  the  relationship  between 
a  molecular  descriptor  and  toxicity  may  be 
intrinsically  nonlinear.  In  such  cases,  the 
use  of  linear  statistical  methods  may  not 
result  in  the  best  models.  Therefore,  in  this 
paper,  we  have  carried  out  a  comparative 
study  of  multiple  regression  vis-a-vis  neural 
net  methods  in  predicting  toxicity  (LC50)  of  a 
set  of  69  benzene  derivatives. 

2.  METHODS 

2.1  Toxicity  Database.  The  utility  of  this 
approach  of  generating  numerous 
hierarchical  theoretical  descriptors  of 
compounds  was  tested  on  a  set  of  acute 


aquatic  toxicity  (LC50)  data  for  sixty-nine 
benzene  derivatives.  The  data  was  taken 
from  a  study  by  Hall,  Kier  and  Phipps’^  who 
collected  acute  aquatic  toxicity  data 
measured  in  fathead  minnow  {Pimephales 
promelas).  This  data  was  compiled  from 
eight  other  literature  sources  and  included 
some  original  work  which  was  conducted  at 
the  U.S.  Environmental  Protection  Agency 
Environmental  Research  Laboratory 
(USEPA  -  ERL)  in  Duluth,  Minnesota.  This 
set  of  chemicals  was  composed  of  benzene 
and  sixty-eight  substituted  benzene 
derivatives.  According  to  the  authors,  these 
benzene  derivatives  were  tested  using 
methodologies  comparable  to  their  own  96- 
hour  fathead  minnow  toxicity  test  system. 
The  derivatives  chosen  for  this  study  (see 
Table  2)  have  seven  different  substituent 
groups  that  are  present  in  at  least  six  of  the 
molecules:  chloro-,  bromo-,  nitro-,  methyl-, 
methoxyl-,  hydroxyl-,  and  amino-. 

2.2  Calculation  of  Topological  Indices 
(TIs).  The  complete  set  of  topological 
indices  (TIs)  used  in  this  study,  both 
topostructural  and  topochemical,  have  been 
calculated  using  POLLY  2.3  and  software 
developed  by  the  Basak  et  ai.  These 
indices  include  Wiener  index, “  the 
connectivity  indices  developed  by  Randic^® 
and  higher  order  connectivity  indices 
formulated  by  Kier  and  Hall,^°  bonding 
connectivity  indices  defined  by  Basak  et 
al,^'  a  set  of  information  theoretic  indices 
defined  on  the  distance  matrices  of  simple 
molecular  graphs,®^®^  a  set  of  parameters 
derived  on  the  neighborhood  complexity  of 
hydrogen-filled  molecular  graphs,^'^®  and 
Balaban's  J  indices.^^'^®  Table  3  provides 
the  symbols  and  brief  definitions  of  the 
topological  indices  included  in  this  study. 

The  set  of  TIs  was  divided  into  two 
distinct  subsets;  topostructural  indices  (TSl) 
and  topochemical  indices  (TCI).  TSls  are 
topological  indices  which  encode 
information  about  the  adjacency  and 
distances  of  atoms  (vertices)  in  molecular 
structures  (graphs)  irrespective  of  the 
chemical  nature  of  the  atoms  involved  in  the 
bonding  or  factors  such  as  hybridization 
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states  of  atoms,  number  of  core/valence 
electrons  in  individual  atoms,  etc.  TCIs  are 
parameters  that  quantify  information 
regarding  the  topology  (connectivity  of 
atoms),  3s  well  as  specific  chemical 
properties  of  the  atoms  comprising  a 
molecule.  TCIs  are  derived  from  weighted 
molecular  graphs  where  each  vertex  (atom) 
is  properly  weighted  with  relevant 
chemical/physical  properties.  Table  3  shows 
the  division  of  the  topological  indices  into 
topostructural  and  topochemical  indices. 

2.3  Calculation  of  Geometrical 
Indices.  The  geometrical  indices  include 
three-dimensional  Wiener  numbers  for 
hydrogen-filled  molecular  structure, 
hydrogen-suppressed  molecular  structure, 
and  van  der  Waals  volume.  Van  der  Waals 
volurhe,  Vw,  was  calculated  SYBYL 

6.4  from  Tripos  Associates,  Inc.  The  3-D 
Wiener  numbers  were  calculated  using 
SYBYL  using  an  SPL  (Sybyl  Programming 
Language)  program  developed  in  our  lab. 
Calculation  of  3-D  Wiener  numbers  consists 
of  the  sum  entries  in  the  upper  triangular 
submatrix  of  the  topographic  Euclidean 
distance  matrix  for  a  molecule.  The  3-D 
coordinates  for  the  atoms  were  determined 
using  CONCORD  3.2.1.*'  The  symbols  and 
definitions  of  the  geometrical  indices  are 
included  in  Table  3. 

2.4  Quantum  Chemical  Parameters. 

Quantum  chemical  parameters  were 
calculated  using  the  Austin  Model  version 
one  (AMI)  semi-empirical  Hamiltonian 
These  parameters  were  calculated  using 
MOP  AC  6.00  in  the  SYBYL  interface.  Brief 
definitions  and  symbols  for  the  quantum 
chemical  parameters  used  in  this  study  are 
included  in  Table  3. 

2.5  Statistical  Analysis  and 
Hierarchical  QSAR.  Initially,  all  topological 
indices  were  transformed  by  the  natural 
logarithm  of  the  index  plus  one.  This  was 
done  to  scale  the  indices,  since  some  may 
be  several  orders  of  magnitude  greater  than 
others,  while  other  indices  may  equal  zero. 
The  geometric  indices  were  transformed  by 
the  natural  logarithm  of  the  index  for 
consistency,  the  addition  of  one  was 
unnecessary. 


The  set  of  eighty-six  topological  indices 
was  then  partitioned  into  the  two  distinct 
sets;  topostructural  indices  (thirty-five)  and 
topochemical  indices  (fifty-one).  The  sets  of 
topostructural  and  topochemical  indices 
were  then  divided  into  subsets,  or  clusters, 
based  on  the  correlation  matrix  using  the 
SAS  variable  clustering  procedure 
(VARCLUS)'*^  to  further  reduce  the  number 
of  independent  variables  for  use  in  model 
construction.  This  procedure  divides  the  set 
of  indices  into  disjoint  clusters,  such  that 
each  cluster  is  essentially  unidimensional. 

From  each  cluster,  the  index  most 
correlated  with  the  cluster  was  selected  for 
modeling,  as  well  as  any  indices  that  were 
poorly  correlated  with  their  cluster  {n  < 

0  70).  These  indices  were  then  used  in  the 
modeling  of  the  acute  aquatic  toxicity  of 
benzene  derivatives  in  fathead  minnow.  The 
variable  clustering  and  selection  of  indices 
was  performed  independently  for  both  the 
topostructural  and  topochemical  indices. 
This  procedure  resulted  in  a  set  of  five 
topostructural  indices  and  a  set  of  nine 
topochemical  indices. 

Reducing  the  number  of  independent 
variables  is  critical  when  attempting  to 
model  small  datasets  using  linear  statistical 
methods.  The  smaller  the  dataset,  the 
greater  the  chance  of  spurious  error  when 
using  a  large  number  of  independent 
variables  (descriptors).  Topliss  and 
Edwards^^  have  thoroughly  studied  this 
issue  of  chance  correlations.  For  a  set  with 
about  seventy  dependent  variables 
(observations),  to  keep  the  probability  of 
chance  correlations  less  than  0.01,  at  most 
forty  independent  variables  may  be  used. 
This  number  is  dependent  on  the  actual 
correlation  achieved  in  the  modeling 
process,  higher  correlation  results  in  a 
better  chance  of  using  more  variables  with 
the  same  limited  probability  of  chance 
correlations.  In  this  study  we  are  well  below 
the  cut-off  of  forty  independent  variables.  In 
fact,  the  total  number  of  descriptors  which 
will  be  used  for  model  construction  and 
estimation  is  twenty-three,  well  within  the 
bounds  of  the  Topliss  and  Edwards 
criteria.'*'* 
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Regression  modeling  was  accomplished 
using  the  SAS  procedure  REG'*^  on  four 
distinct  sets  of  indices.  These  sets  were 
constructed  as  part  of  a  hierarchical 
approach  to  QSAR  model  development. 
The  hierarchy  begins  with  the  simplest 
parameters,  the  TSls.  After  using  the  TSIs 
to  model  the  activity,  the  next  level  of 
parameters  of  higher  complexity  are  added. 
To  the  indices  included  in  the  best  TSI 
model,  we  add  all  of  the  TCIs  and  proceed 
to  model  the  activity  using  these 
parameters.  Likewise,  the  indices  included 
in  the  best  model  from  this  procedure  are 
combined  with  the  indices  from  the  next 
complexity  level,  the  geometrical  indices 
and  modeling  is  conducted  once  again. 
Finally,  the  best  model  utilizing  TSIs,  TCIs 
and  geometrical  indices  is  combined  with 
the  quantum  chemical  parameters  to 
develop  the  final  model  in  the  hierarchy. 

2.6  Neural  Network  Methods.  Using 
neural  networks,  we  studied  two  classes  of 
approaches  for  modeling  toxicity:  (1)  giving 
all  the  descriptors  to  a  learning  algorithm 
(neural  network  in  this  case),  and  (2) 
reducing  the  feature  set  before  giving  the 
(reduced)  feature  set  to  a  learning 
algorithm.  Results  for  our  approaches  are 
from  leave-one-out  experiments  (/.e.,  sixty- 
nine  training/test  set  partitions).  Leave-one- 
out  works  by  leaving  one  data  point  out  of 
the  training  set  and  giving  the  remaining 
instances  (sixty-eight  in  this  case)  to  the 
learning  algorithms  for  training.  This 
process  is  repeated  sixty-nine  times  so  that 
each  example  is  a  part  of  the  test  set  once 
and  only  once.  Leave-one-out  tests 
generalization  accuracy  of  a  learner, 
whereas  training  set  accuracy  tests  only  the 
learner's  ability  to  memorize.  Generalization 
error  from  the  test  set  is  the  true  test  of 
accuracy  and  is  what  we  report  here. 

First  we  trained  neural  networks  using 
all  ninety-five  parameters:  thirty-five  TSI, 
fifty-one  TCI,  three  geometrical  and  six 
quantum  chemical  parameters.  The 
networks  contained  fifteen  hidden  units  and 
were  trained  for  1000  epochs.  Each  input 
parameter  was  normalized  to  a  value 
between  0  and  1  before  training.  Additional 


parameter  settings  for  the  neural  networks 
included  a  learning  rate  of  0.05,  a 
momentum  term  of  0.1  and  weights 
initialized  randomly  between  -0.25  and 
0.25. 

For  our  next  experiment,  we  used  a 
smaller  set  of  twenty-three  independent 
variables.  The  twenty-three  independent 
variables  were  the  topostructural  and 
topochemical  parameters  provided  by  the 
variable  clustering  technique  (see  section 
4.1  for  a  list  of  the  indices)  combined  with 
the  three  geometrical  and  six  quantum 
chemical  parameters  described  in  Table  3. 
The  parameter  settings  for  these  networks 
were  the  same  as  the  settings  for  the  other 
neural  network  experiments  mentioned 
above. 

3.  RESULTS 

3.1  Results  of  Statistical  Regression 
Procedures.  The  variable  clustering  of  the 
topostructural  indices  resujted  in  the 
retention  of  five  indices:  Mi,  1C,  O,  Ps,  P9. 
All-subsets  regression  resulted  in  the 
selection  of  a  four-parameter  model  to 
estimate  -log(LC5o)  with  an  explained 
variance  (R^)  of  45.3%  and  a  standard  error 
(s)  of  0.58.  While  this  is  an  unsatisfactory 
model,  the  indices  were  still  retained  and 
combined  with  the  topochemical  indices  in 
the  second  step  of  model  development.  The 
second  step  combined  the  four  indices  used 
in  the  first  tier  model  with  the  nine 
topochemical  indices  selected  in  the 
variable  clustering  procedure:  SICo,  SIC1, 
SIC4,  CICo,  Y.  Yc.  Yc,  Vpc.  J’'.  Again, 
all-subsets  regression  was  conducted 
resulting  in  a  four-parameter  model  with  an 
explained  variance  (R^)  of  78.3%  and  a 
standard  error  (s)  of  0.36.  The  four  indices 
from  the  second  tier  model  were  combined 
with  the  three  geometric  parameters:  ^°\Nh, 
^°\N,  Vw.  This  resulted  in  a  four-parameter 
model  that  replaced  the  topochemical  index 
CICo  with  the  geometric  parameter  ^°\Nh. 
This  model  had  an  explained  variance  (R^) 
of  79.2%  and  a  standard  error  (s)  of  0.36. 
The  final  step  in  the  hierarchical  method 
combined  the  four  parameters  from  the  third 
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tier  model  with  the  semi-empirical  quantum 
chemical  parameters;  Ehomo,  Ehomoi.  Elumo. 
Elumoi.  ah,,  p.  This  set  of  ten  indices  led 
to  a  seven-parameter  model  with  an 
explained  variance  (/^)  of  86.3%  and  a 
standard  error  (s)  of  0.30.  This  model 
retained  all  indices  from  the  third  model  and 
added  three  of  the  AM1  quantum  chemical 
parameters. 

The  leave-one-out  analysis  was 
conducted  on  the  final  model  for  purposes 
of  comparison  with  the  results  of  the  neural 
networks.  This  analysis  resulted  in  a  final 
explained  variance  for  the  model  of  = 
0.825  and  a  standard  error  of  s  =  0.32. 

3,2  Results  of  the  Neural  Network 
Procedures.  The  first  class  of  approach 
incorporating  all  ninety-five  parameters, 
obtained  a  test-set  correlation  coefficient 
between  predicted  toxicity  and  measured 
toxicity  (explained  variance)  of  =  0.868 
and  a  standard  error  of  0.29.  The  second 
class  of  neural  network  approaches  utilizing 
the  twenty-three  parameters  from  the  data 
reduction  step  obtained  a  test-set  explained 
variance  (R^)  of  0.878  and  a  standard  error 
(s)  of  0.28.  The  results  from  the  leave-one- 
out  analysis  using  the  linear  statistical 
method  and  the  neural  network  methods  are 
summarized  in  Table  4.  Table  2  presents 
the  experimental  acute  aquatic  toxicity 
(-loglLCsol)  values  for  the  sixty-nine 
benzene  derivatives  as  well  as  the  values 
estimated  by  the  best  statistical  model  and 
the  two  neural  network  models. 

4.  DISCUSSION 

The  results  show  that  both  statistical  and 
neural  network  models  give  acceptable 
estimates  of  the  toxicity  of  the  sixty-nine 
benzene  derivatives  studied  in  this  paper. 
However,  when  tested  using  the  leave-one- 
out  approach,  the  statistical  model  falls 
•  short  of  the  performance  of  the  neural 
network  models.  It  has  to  be  noted, 
however,  that  statistical  QSARs  are  based 
on  linear  models  whereas  the  two  neural 
network  models  use  nonlinear  methods. 

It  is  interesting  to  note  that  the  neural 
network  model  using  the  subset  of  twenty- 


three  inputs  selected  in  part  by  the 
VARCLUS  procedure  gave  slightly  better 
results  as  compared  to  the  network 
developed  using  all  ninety-five  input 
variables.  This  could  be  the  result  of  filtering 
out  redundant,  or  nearly  redundant, 
parameters  from  the  set  of  independent 
variables. 

Further  work  on  the  relative  utility  of 
statistical  vis-a-vis  neural  network  methods 
is  necessary  to  determine  which  types  of 
models  are  best  suited  to  the  estimation  of 
chemical  toxicity. 
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Table  1.  Physicochemical  and  biological  properties  relevant  to  the  assessment  of 
toxicity. 


_ Physicochemical 

Molar  Volume 
Boiling  Point 
Melting  Point 
Vapor  Pressure 
Aqueous  Solubility 
Dissociation  Constant  (pKa) 
Partition  Coefficient 

Octanol-Water  (log  P) 
Air-Water 
Sediment-Water 
Reactivity  (Electrophile) 


_ Biological 

Receptor  Binding  (Kd) 
Michaelis  Constant  (Km) 
Inhibitor  Constant  (Kj) 
Biodegradation 
Bioconcentration 
Alkylation  Profile 
Metabolic  Profile 
Chronic  Toxicity 
Carcinogenicity 
Mutagenicity 
Acute  Toxicity 
LD50 
LC50 


Compound _ 

Benzene 

Bromobenzene 

Chlorobenzene 

Phenol 

Toluene 

1 .2- dichlorobenzene 
1’3-dichlorobenzene 
1 ,4-dichlorobenzene 

2- chlorophenol 

3- chlorotoluene 

4- chlorotoluene 

1 .3- dihydroxybenzene 
3-hydroxyanisole 

2- methylphenol 

3- methylphenol 

4- methylphenol 

4-nitrophenol 

1 .4- dimethoxybenzene 
1  ’,2-dimethylbenzene 

1 .4- dimethylbenzene 

2- nitrotoluene 

3- nltrotoluene 

4- nitrotoluene 

1.2- dinitrobenzene 

1 .3- dinitrobenzene 

1 .4- dinitrobenzene 

2-methyl-3-nitroaniline 
2-methyl-4-nitroaniline 
2-methyl-5-nitroaniline 

2- methyl-6-nitroanirme 

3- methyl-6-nitroaniline 

4- methyl-2-nitroaniline 
4-hydroxy-3-nitroanirme 
4-methyl-3-nitroaniline 

1 .2.3- trichlorobenzene 

1 .2.4- trichlorobenzene 
1  ’3,5-trichlorobenzene 

2.4- dichlorophenol 

3.4- dichlorotoluene 

2.4- dichlorotoluene 


Exp. 

3.40 

LR 

3.42 

3.89 

3.77 

3.77 

3.75 

3.51 

3.38 

3.32 

3.66 

4.40 

4.29 

4.30 

4.37 

4.62 

4.51 

4.02 

3.79 

3.84 

3.88 

4.33 

3.87 

3.04 

3.43 

3.21 

3.33 

3.77 

3.64 

3.29 

3.60 

3.58 

3.53 

3.36 

3.61 

3.07 

3.28 

3.48 

3.93 

4.21 

3.87 

3.57 

3.66 

3.63 

3.53 

3.76 

3.49 

5.45 

5.24 

4.38 

4.18 

5.22 

4.94 

3.48 

3.79 

3.24 

3.51 

3.35 

3.68 

3.80 

3.84 

3.80 

3.78 

3.79 

3.80 

3.65 

3.61 

3.77 

3.73 

4.89 

4.89 

5.00 

5.04 

4.74 

5.11 

4.30 

4.33 

4.74 

4.26 

4.54 

4.36 

NN95 

NN23 

3.66 

3.65 

4.02 

3.79 

3.80 

3.77 

3.44 

3.51 

3.50 

3.62 

4.24 

4.30 

4.03 

4.12 

4.46 

4.27 

3.82 

3.91 

3.72 

3.79 

3.78 

3.76 

3.47 

3.53 

3.40 

3.45 

3.55 

3.67 

3.51 

3.58 

3.54 

3.55 

3.65 

3.76 

3.79 

3.51 

3.88 

3.91 

3.74 

3.68 

3.78 

3.81 

3.71 

3.71 

3.68 

3.68 

4.91 

4.99 

4.30 

4.19 

4.38 

4.85 

3.79 

3.88 

3.79 

3.75 

3.82 

3.86 

3.73 

3.79 

3.64 

3.62 

3.73 

3.66 

3.53 

3.58 

3.72 

3.72 

4.85 

5.04 

5.05 

4.83 

4.62 

4.78 

4.42 

4.47 

4.39 

4.28 

4.47 

4.44 
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4-chloro-3-methylphenol 

4.27 

3.87 

3.96 

4.07 

2,4-dimethylphenol 

3.86 

3.76 

3.78 

3.72 

2,6-dimethylphenol 

3.75 

3.80 

3.71 

3.84 

3,4-dimethylphenol 

3.90 

3.80 

3.92 

3.79 

2,4-dinitrophenol 

4.04 

4.14 

4.15 

4.01 

1 ,2,4-trimethylbenzene 

4.21 

4.09 

4.53 

3.87 

2,3-dinitrotoluene 

5.01 

5.20 

5.12 

5.28 

2,4-dinitrotoluene 

3.75 

4.10 

4.65 

4.33 

2,5-dinitrotoluene 

5.15 

4.84 

4.71 

4.72 

2,6-dinitrotoluene 

3.99 

4.41 

4.56 

4.63 

3,4-dinitrotoluene 

5.08 

5.11 

5.11 

5.09 

3,5-dinitrotoluene 

3.91 

4.05 

4.41 

4.16 

1 ,3,5-trinitrobenzene 

5.29 

5.37 

5.34 

5.32 

2-methyl-3,5-dinitroaniline 

4.12 

4.13 

4.30 

4.23 

2-methyl-3,6-dinitroaniline 

5.34 

4.80 

4.40 

4.54 

3-methyl-2,4-dinitroaniline 

4.26 

4.28 

4.14 

4.20 

5-methyl-2,4-dinitroaniline 

4.92 

4.14 

4.00 

4.02 

4-methyl-2,6-dinitroaniline 

4.21 

4.67 

4.57 

4.58 

5-methyl-2,6-dinitroaniiine 

4.18 

4.80 

4.53 

4.78 

4-methyl-3,5-dinitroaniline 

4.46 

4.34 

4.32 

4.43 

2,4,6-tribromophenol 

4.70 

4.89 

5.34 

5.47 

1 ,2,3,4-tetrachlorobenzene 

5.43 

5.62 

5.50 

5.56 

1 ,2,4,5-tetrachlorobenzene 

5.85 

5.80 

5.63 

5.61 

2,4,6-trichlorophenol 

4.33 

4.79 

4.86 

4.96 

2-methyl-4,6-dinitrophenol 

5.00 

4.21 

4.20 

4.16 

2,3,6-trinitrotoluene 

6.37 

6.36 

5.84 

5.81 

2,4,6-trinitrotoluene 

4.88 

5.16 

5.39 

5.42 

2,3,4,5-tetrachlorophenol 

5.72 

5.36 

5.44 

5.58 

2,3,4,5,6-pentachlorophenol 

6.06 

6.03 

5.86 

5.83 
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Table  3.  Symbols,  definitions  and  classifications  of  topological,  geometrical  and 
quantum  chemical  parameters. 

_ Topostructural _ _ _ 

Id'^  Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs 
of  vertices  of  a  graph 

Td'^  Mean  information  index  for  the  magnitude  of  distance 
W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix 
of  a  graph 

1°  Degree  complexity 

Graph  vertex  complexity 
Graph  distance  complexity 

iC  Information  content  of  the  distance  matrix  partitioned  by  frequency  of 
occurrences  of  distance  h 

O  Order  of  neighborhood  when  ICr  reaches  its  maximum  value  for  the 
hydrogen-filled  graph 

Ml  A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

M2  A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 

neighboring  (connected)  vertices 
Path  connectivity  index  of  order  h  =  0-6 
Cluster  connectivity  index  of  order  h  =  3,  5 
\ch  Chain  connectivity  index  of  order  h  =  S 
\pc  Path-cluster  connectivity  index  of  order  h  =  4-6 
Ph  Number  of  paths  of  length /?  =  0-10 

J  Balaban's  J  index  based  on  distance _ _ _ _ _ 

_ Topochemical  _ 

loRB  Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 
maximum  neighborhood  of  vertices 

ICr  Mean  information  content  or  complexity  of  a  graph  based  on  the  (r  =  0-6) 
order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 
SICr  Structural  information  content  for  (r=  0-6)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

CICr  Complementary  information  content  for  (r  =  0-6)  order  neighborhood  of 
vertices  in  a  hydrogen-filled  graph 
Bond  path  connectivity  index  of  order  h  =  0-6 
Bond  cluster  connectivity  index  of  order  /?  =  3,  5 
Bond  chain  connectivity  index  of  order  h  =  6 
\^pc  Bond  path-cluster  connectivity  index  of  order  h  =  4-6 
Valence  path  connectivity  index  of  order  h  =  0-6 
Vc  Valence  cluster  connectivity  index  of  order  h  =  3,  5 


II 


'’x'^ch  Valence  chain  connectivity  index  of  order  h  =  6 
Vpc  Valence  path-cluster  connectivity  index  of  order  h  =  4-6 
J®  Balaban’s  J  index  based  on  bond  types 

Balaban's  J  index  based  on  relative  electronegativities 

Balaban's  J  index  based  on  relative  covalent  radii  _ 

_ Geometrical  _ 

Vw  van  der  Waals  volume 

3-D  Wiener  number  for  the  hydrogen-suppressed  geometric  distance  matrix 
3-D  Wiener  number  for  the  hydrogen-filled  geometric  distance  matrix 

_ Quantum  Chemical  _ 

Ehomo  Energy  of  the  highest  occupied  molecular  orbital 
Ehomoi  Energy  of  the  second  highest  occupied  molecular  orbital 
Elumo  Energy  of  the  lowest  unoccupied  molecular  orbital 
Elumoi  Energy  of  the  second  lowest  unoccupied  molecular  orbital 
AHf  Heat  of  formation 


Table  4.  Relative  effectiveness  of  statistical  and  neural  network  methods  in  estimating 
the  acute  aquatic  toxicity  of  69  benzene  derivatives. 


#  Independent 

Method 

Variables 

s 

Statistical 

Neural  network 
Neural  network 

7 

95 

23 

0.825 

0.868 

0.878 

0.32 

0.29 

0.28 
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Appendix  1.2  Hazard  assessment  modeling:  An  evolutionary 

ensemble  approach 


Hazard  Assessment  Modeling:  An  Evolutionary  Ensemble  Approach 


David  W.  Opitz 

Department  of  Computer  Science 
University  of  Montana 
Missoula,  MT  59812  (USA) 
opitz@cs.umt.edu 
406-243-2831 

Abstract 

This  paper  presents  a  novel  and  effective 
genetic  algorithm  approach  for  generating 
computational  models  for  hazard  assessment. 

With  millions  of  proposed  chemicals  being 
registered  each  year,  it  is  impossible  to  come 
even  remotely  close  to  completing  the  battery 
of  tests  needed  for  the  proper  understanding 
of  the  toxic  effects  of  these  chemicals.  Com¬ 
puter  models  can  give  quick,  cheap,  and  en¬ 
vironmentally  friendly  hazard  assessments  of 
chemicals.  Our  approach  works  by  first  ex¬ 
tracting  a  hierarchy  of  theoretical  descriptors 
of  the  structure  of  a  compound,  then  filtering 
these  numerous  descriptors  with  a  genetic  al¬ 
gorithm  approach  to  ensemble  feature  selec¬ 
tion.  We  tested  the  utility  of  our  approach  by 
modeling  the  acute  aquatic  toxicity  (LC50) 
of  a  congeneric  set  of  69  benzene  derivatives. 

Our  results  demonstrate  a  very  important 
point;  that  our  method  is  able  to  accurately 
predict  toxicity  directly  from  structure. 

1  INTRODUCTION 

By  the  end  of  1998  the  number  of  chemicals  registered 
with  the  Chemical  Abstract  Service  rose  to  over  19 
million  (CAS  1999).  This  is  an  increase  of  over  3 
million  chemicals  between  1996  and  1998.  It  is  de¬ 
sirable  to  test  each  of  these  chemicals  for  their  effects 
on  the  environment  and  human  health  (which  we  re¬ 
fer  to  as  hazard  assessment):  however,  completing  the 
battery  of  tests  necessary  for  the  proper  hazard  as¬ 
sessment  of  even  a  single  compound  is  a  costly  and 
time-consuming  process.  Therefore,  there  is  simply 
not  enough  time  or  money  to  complete  these  test  bat¬ 
teries  for  even  a  tiny  portion  of  the  compounds  which 
are  regLstered  today  (Menzel  1995).  An  alternative  to 
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these  traditional  test  batteries  is  to  develop  computa¬ 
tional  models  for  hazard  assessment.  Computational 
models  are  fast  (milliseconds  per  compound),  cheap 
(less  than  one  cent  per  compound),  and  do  not  run 
the  risk  of  adversely  affecting  the  environment  during 
testing.  Additionally,  these  computational  methods 
can  replace  or  limit  the  amount  of  animal  testing  that 
is  necessary.  Thus  computational  models  can  easily 
process  all  registered  chemicals  and  flag  the  ones  that 
require  further  testing.  The  central  problem  with  this 
approach  is  developing  class  specific  models  that  can 
be  considered  accurate  enough  to  be  useful.  In  this 
paper,  we  present  a  novel  and  effective  ajrproach  for 
learning  computational  hazard  assessment  models  by 
using  an  ensemble  feature  selection  algorithm  based  on 
genetic  algorithms  (GAs)  to  filter  numerous  theoreti¬ 
cal  descriptors  of  chemical  structure. 

To  better  illustrate  the  need  for  effective  and  quick 
hazard  assessment,  we  should  consider  the  situation 
of  the  industrial  chemicals  ’’grandfathered”  into  con¬ 
tinued  use  under  the  Toxic  Substances  Control  Act 
(TSCA)  of  1976.  TSCA  has  required  that  a  .suite  of 
physicochemical  and  toxicological  screens  be  run  on  all 
commercial  compounds  (those  produced  or  imported 
in  volumes  exceeding  one  million  pounds  annually)  de¬ 
veloped  after  1976.  However,  there  are  almost  3,000 
chemicals  that  were  ’’grandfathered”  in  with  the  un¬ 
derstanding  that  it  would  be  the  responsibility  of  the 
chemical  manufacturing  industry  to  ultimately  sup¬ 
ply  information  about  these  chemicals.  Only  recently, 
after  a  20-year  delay,  are  the  chemical  manufactur¬ 
ers  talking  about  running  2,800  of  these  compounds 
through  basic  toxicity  screens  and  while  this  is  promis¬ 
ing,  the.se  screens  will  not  be  completed  until  2004  and 
at  a  cost  of  between  $500  to  $700  million  dollars.  So  it 
will  be  another  five  years  before  we  have  basic  toxicity 
data  on  compounds  that  have  been  in  wide-spread  use 
for  more  than  twenty  years  (.Johnson  1998). 

One  of  the  fundamental  principles  of  biochemistry  is 


that  activity  is  dictated  by  structure  (Hansch  1976). 
Following  this  principle,  one  can  use  theoretical  molec¬ 
ular  descriptors  that  quantify  structural  aspects  of 
a  molecule  to  quantitatively  determine  its  activity 
(Basak  &  Grunwald  1995;  Cramer,  Famini,  &  Lowrey 
1993).  These  theoretical  descriptors  can  be  generated 
directly  from  the  known  structure  of  the  molecule  and 
used  to  estimate  its  properties,  without  the  need  for 
further  experimental  data.  This  is  important  due  to 
that  fact  that,  with  chemicals  needing  to  be  evaluated 
for  hazard  assessment,  there  is  a  scarcity  of  available 
experimental  data  that  is  normally  required  as  inputs 
(i.e.,  independent  variables)  to  traditional  quantitative 
structure-activity  relationship  (QSAR)  model  develop¬ 
ment.  A  QSAR  model  based  solely  on  theoretical  de¬ 
scriptors  on  the  other  hand  can  process  all  registered 
chemicals  for  hazard  assessment. 

Our  hierarchical  approach  examines  the  relative  con¬ 
tributions  of  theoretical  descriptors  of  gradually  in¬ 
creasing  complexity  (structural,  chemical,  shape,  and 
quantum  chemical  descriptors).  This  approach  is  im¬ 
portant  as  none  of  the  individual  classes  of  parame¬ 
ters  are  very  effective  at  predicting  toxicity  (Cute  & 
Basak  1997);  however,  we  show  in  this  paper  that  we 
can  effectively  predict  toxicity  if  we  combine  all  levels 
of  descriptors.  One  potential  problem  with  using  our 
hierarchical  approach  is  that  it  often  gives  many  in¬ 
dependent  variables  as  compared  to  data  points  since 
having  a  limited  number  of  data  points  in  not  uncom¬ 
mon  in  hazard  assessment.  For  instance,  in  our  case 
study  of  predicting  acute  toxicity  (LC50)  of  benzene 
deriv'atives.  we  have  95  independent  variables  and  69 
data  points.  Therefore,  reducing  the  number  .of  inde¬ 
pendent  variables  is  critical  when  attempting  to  model 
small  data  sets.  The  smaller  the  data  set,  the  greater 
the  chance  of  spurious  error  when  using  a  large  num¬ 
ber  of  independent  variables  (descriptors).  In  some 
of  our  earlier  QSAR  studies  we  have  used  statistical 
methods  such  as  principal  components  analysis  (PCA) 
and  variable  clustering  metliods  to  reduce  the  num¬ 
ber  of  independent  variables  (Basak  &  Grunwald  1995; 
Gute  Basak  1997;  Gute,  Grunwald,  &  Basak  In 
press). 

As  an  alternative  solution,  we  use  our  previous  en¬ 
semble  feature  selection  approach  (Opitz  1999)  that 
is  based  on  GAs.  An  “ensemble"  is  a  combination 
of  the  outputs  from  a  set  of  models  that  are  gener¬ 
ated  from  separately  trained  inductive  learning  algo¬ 
rithms.  Ensembles  have  been  shown  to,  in  most  cases, 
greatly  improve  generalization  accuracy  over  a  single 
learning  model  (Breiman  1996;  Maclin  &  Opitz  1997; 
Shapire  et  al  1997).  Recent  research  has  shown  that 
an  effective  ensemble  should  consist  of  a  set  of  models 


that  are  not  only  highly  correct,  but  ones  that  make 
their  errors  on  different  parts  of  the  input  space  as 
well  (Hansen  &  Salamon  1990;  Krogh  &  Vedelsby  1995; 
Opitz  &  Shavlik  1996a).  Varying  the  feature  subsets 
used  by  each  member  of  the  ensemble  heli)s  promote 
the  necessary  diversity  and  create  a  more  effective  en¬ 
semble  (Opitz  1999).  We  use  GAs  to  search  through 
the  enormous  space  of  finding  a  set  of  feature  subsets 
that  will  promote  disagreement  among  the  component 
members  of  an  ensemble  while  still  maintaining  the 
component  member’s  accuracy. 

Combining  our  approach  of  generating  hierarchical 
theoretical  descriptors  with  our  other  approach  to  GA- 
based  ensemble  feature  selection,  we  are  able  to  gen¬ 
erate  an  effective  model  for  predicting  the  toxicity  of 
benzene  derivatives  using  only  a  few  compounds.  Our 
results  show  that  our  model  is  nearly  as  accurate  as  the 
battery  of  tests  necessary  for  the  proper  hazard  assess¬ 
ment  of  a  single  compound.  Our  results  also  confirm 
that  our  new  ensemble  feature  selection  approach  is 
more  effective  than  previous  approaches  for  modeling 
hazard  assessment. 

The  rest  of  the  paper  is  organized  as  follows.  First 
we  provide  background  and  related  work  for  both  our 
hierarchical  QSAR  approach  and  our  GA-based  en¬ 
semble  feature  selection  approach.  This  is  followed  by 
results  of  our  approach  applied  to  benzene  derivatives. 
Finally,  we  discuss  these  results  and  provide  future 
work. 

2  QSAR  AND  THEORETICAL 
METHODS 

QSARs  have  come  into  widespread  use  for  the  pre¬ 
diction  of  various  molecular  properties,  as  well  as  bi¬ 
ological,  pharmacological  and  toxicological  responses. 
Traditional  QSAR  techniques  use  empirical  properties 
(Dearden  1990;  Hansch  h  Leo  1995;  de  Waterbeemd 
1995);  however,  due  to  the  scarcity  of  available  data 
for  the  majority  of  chemicals  needing  to  be  evaluated 
for  hazard  assessment,  these  physicochemical  proper¬ 
ties  necessary  for  traditional  QSAR  model  develop¬ 
ment  may  not  be  available.  When  this  is  the  case,  it 
is  imperative  that  there  are  methods  available  which 
make  use  of  nonempirical  parameters,  which  we  term 
theoretical  molecular  descriptors. 

Topological  indices  (TIs)  are  numerical  graph  invari¬ 
ants  that  quantify  certain  aspects  of  molecular  struc¬ 
ture  (Gute  &:  Basak  1997;  Gute,  Grunwald,  &  Basak 
In  press).  The  different  classes  of  TIs  provide  us 
with  nonempirical,  quantitative  descriptors  that  can 
be  used  in  place  of  experimentally  derived  descriptors 


in  QSARs  for  the  prediction  of  properties. 

Our  recent  studies  have  focused  on  the  role  of  different 
classes  of  theoretical  descriptors  of  increasing  levels  of 
complexity  and  their  utility  in  QSAR  (Gute  &  Basak 
1997;  Gute,  Grunwald,  &  Basak  In  press).  Four  dis¬ 
tinct  sets  of  theoretical  descriptors  have  been  used  in 
this  study:  topostructural,  topochemical,  geometric, 
and  quantum  chemical  indices.  Gute  and  Basak  1997 
provide  the  detailed  list  of  the  indices  included  in  our 
study. 

2.1  TOPOLOGICAL  INDICES 

The  topostructural  and  topochemical  indices  fall  into 
the  category  normally  considered  topological  indices. 
Topostructural  indices  (TSIs)  are  topological  indices 
that  only  encode  information  about  the  adjacency  and 
distances  of  atoms  (vertices)  in  molecular  structures 
(graphs),  irrespective  of  the  chemical  nature  of  the 
atoms  involved  in  bonding  or  factors  such  as  hybridiza¬ 
tion  states  and  the  number  of  core/valence  electrons 
in  individual  atoms.  Topochemical  indices  (TCIs) 
are  parameters  that  quantify  information  regarding 
the  topology  (connectivity  of  atoms),  as  well  as  spe¬ 
cific  chemical  properties  of  the  atoms  comprising  a 
molecule.  These  indices  are  derived  from  weighted 
molecular  graphs  where  each  vertex  (atom)  or  edge 
(bond)  is  properly  weighted  with  selected  chemical  or 
physical  property  information. 

The  complete  set  of  topological  indices  used  in  this 
study,  both  the  topostructural  and  the  topochemi¬ 
cal,  have  been  calculated  using  POLLY  2.3  (Basak, 
Harriss,  fc  Magnuson  1988)  and  software  developed 
by  the  authors.  These  indices  include  the  Wiener  in¬ 
dex  (Wiener  1947),  the  connectivity  indices  developed 
by  Randic  1975  and  higher  order  connectivity  indices 
formulated  by  Kier  and  Hall  1986,  bonding  connec¬ 
tivity  indices  defined  by  Basak  and  Magnuson  1988, 
a  set  of  information  theoretic  indices  defined  on  the 
distance  matrices  of  simple  molecular  graphs  (Hansch 
&  Leo  1995),  and  neighborhood  complexity  indices  of 
hydrogen-filled  molecular  graphs,  and  Balaban  s  1983 
J  indices. 

2.2  GEOMETRICAL  INDICES 

The  geometrical  indices  are  three-dimensional  Wiener 
numbers  for  hydrogen-filled  molecular  structure, 
hydrogen-suppressed  molecular  structure,  and  van  der 
Waals  volume.  Van  der  Waals  volume,  V\v  (Bondi 
1964).  was  calculated  using  Sybyl  6.1  from  Tripos  A.s- 
sociates,  Inc.  of  St.  Louis.  The  3-D  Wiener  numbers 
were  calculated  by  Sybyl  using  an  SPL  (Sybyl  Pro¬ 


gramming  Language)  program  developed  in  our  lab 
(SYBYL  1998).  Calculation  of  3-D  Wiener  numbers 
consists  of  the  sum  entries  in  the  upper  triangular  sub¬ 
matrix  of  the  topographic  Euclidean  distance  matrix 
for  a  molecule.  The  3-D  coordinates  for  the  atoms 
were  determined  using  CONCORD  3.0.1  from  Tripos 
Associates,  Inc.  Two  variants  of  the  3-D  Wiener  num¬ 
ber  were  calculated:  .  For  ^  B  //, 

hydrogen  atoms  are  included  in  the  computations  and 
for  hydrogen  atoms  are  excluded  from  the  com¬ 
putations. 

2.3  QUANTAM  CHEMICAL 
PARAMETERS 

The  following  quantum  chemical  parameters  were  cal¬ 
culated  using  the  Austin  Model  version  one  (AMI) 
semi-empirical  Hamiltonian:  energy  of  the  highest  oc¬ 
cupied  molecular  orbital  {Ehomo)>  energy  of  the  sec¬ 
ond  highest  occupied  molecular  orbital  {Ehomo\)< 
energy  of  the  lowest  unoccupied  molecular  orbital 
{Elumo)>  energy  of  the  second  lowest  unoccu¬ 
pied  molecular  orbital  {Elumo\),  heat  of  formation 
(AH/),  and  dipole  moment  (/i).  These  parameters 
were  calculated  using  MOPAC  6.00  in  the  SYBYL  in¬ 
terface  (Stewart  1990). 

3  FILTERING  DESCRIPTORS 

As  stated  above,  one  potential  problem  with  including 
all  theoretical  descriptors  in  the  hierarchy  is  that  it 
gives  many  independent  variables  when  compared  to 
the  limited  number  of  data  points  available  for  hazard 
assessment  modeling  of  a  particular  chemical  deriva¬ 
tive.  Compounding  this  problem  is  that  a  salient  de¬ 
scriptor  for  one  hazard  assessment  model  may  not  be  a 
salient  descriptor  for  another  problem.  That  is,  the  rel¬ 
evance  of  a  descriptor  for  predicting  hazard  assessment 
is  often  problem  dependent.  This  section  describes 
our  approach  for  automatically  filtering  the  descrip¬ 
tors  with  a  GA-based  approach  to  ensemble  feature 
detection.  Before  explaining  our  algorithm,  we  briefly 
cover  the  notion  of  ensembles. 

3.1  ENSEMBLES 

Figure  1  illustrates  the  basic  framework  of  a  predictor 
ensemble.  Each  predictor  in  the  ensemble  (predictor  1 
through  predictor  N  in  this  case)  is  first  trained  using 
the  training  instances.  Then,  for  each  example,  the 
predicted  output  of  each  of  the.se  predictors  (o,  in  Fig¬ 
ure  1)  is  combined  to  produce  the  output  of  the  ensem¬ 
ble  (d  in  Figure  1).  Many  researchers  (Breirnan  1996; 
Han.sen  k  Salamon  1990;  Krogh  &  Vedelsby  1995; 
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Figure  1:  A  predictor  ensemble. 


Opitz  k  Shavlik  1997)  have  demonstrated  the  effec¬ 
tiveness  of  combining  schemes  that  are  simply  the 
weighted  average  of  the  predictors  (i.e.,  d  =  Xlie/v  ' 
Oi  and  ensemble 

on  which  we  focus  in  this  article. 

Combining  the  output  of  several  predictors  is  useful 
only  if  there  is  disagreement  on  some  inputs.  Obvi¬ 
ously,  combining  several  identical  predictors  produces 
no  gain.  Hansen  and  Salamon  1990. proved  that  for  an 
ensemble,  if  the  average  error  rate  for  an  example  is 
less  than  50%  and  the  predictors  in  the  ensemble  are 
independent  in  the  production  of  their  errors,  the  ex¬ 
pected  error  for  that  example  can  be  reduced  to  zero 
a,s  the  number  of  predictors  combined  goes  to  infinity: 
however,  such  assumptions  rarely  hold  in  practice. 

Krogh  and  Vedelsby  1995  later  proved  that  the  ensem¬ 
ble  error  can  be  divided  into  a  term  measuring  the  av¬ 
erage  generalization  error  of  each  individual  predictor 
and  a  term  called  diversity  that  measures  the  disagree¬ 
ment  among  the  predictors.  Formally,  they  define  the 
divensity  term,  of  predictor  i  on  input  3'  to  be: 

d, (x)  =  loi(x)  -  d(x)f .  (1) 

The  quadratic  error  of  predictor  i  and  of  the  ensemble 
are,  respectively: 

e. (.r)  =  [o,(a0  -  /(:-:)p.  (2) 

e(x)  =  ld(x)~f(x)f,  (3) 

where  f(x)  is  the  target  value  for  input  .r.  If  we  de¬ 
fine  E,  Ei,  and  Di  to  be  the  averages,  over  the  input 
distribution,  of  e(,T),  e(a:),  and  d{x)  respectively,  then 
the  ensembles  generalization  error  can  be  shown  to 
consist  of  two  distinct  portions: 

E^E-D,  (4) 


where  E  (=  ^-WfEi)  is  the  weighted  average  of 
the  individual  predictor’s  generalization  error  and  D 
{—  ^.lUiDi)  is  the  weighted  average  of  the  diversity 
among  these  predictors.  What  the  equation  shows 
then,  is  that  an  ideal  ensemble  consists  of  highly 
correct  predictors  that  disagree  as  much  as  possible. 
Opitz  and  Shavlik  1996a;  1996b  empirically  verified 
that  such  ensembles  generalize  well. 

Regardless  of  theoretical  justifications,  methods  for 
creating  ensembles  center  around  producing  predic¬ 
tors  that  disagree  on  their  predictions.  Generally, 
these  methods  focus  on  altering  the  training-  pro¬ 
cess  in  the  hope  that  the  resulting  predictors  will 
produce  different  predictions.  For  example,  neural 
network  techniques  that  have  been  employed  include 
methods  for  training  with  different  topologies,  differ¬ 
ent  initial  weights,  different  parameters,  and  training 
only  on  a  portion  of  the  training  set  (Alpaydin  1993: 
Freund  k  Schapire  1996;  Hansen  k  Salamon  1990; 
Maclin  k  Shavlik  1995). 

Numerous  techniques  try  to  generate  disagreement 
among  the  classifiers  by  altering  the  training  set  each 
classifier  sees.  The  two  most  popular  techniques 
are  Bagging  (Breinian  1996)  and  Boosting  (Freund 
k  Schapire  1996).  Bagging  is  a  bootstrap  ensem¬ 
ble  method  that  trains  each  network  in  the  ensemble 
with  a  different  partition  of  the  training  set.  It  gener¬ 
ates  each  partition  by  randomly  drawing,  with  replace¬ 
ment,  N  examples  from  the  training  set,  where  N  is 
the  size  of  the  training  set.  As  with  Bagging,  Boosting 
also  chooses  a  training  set  of  size  N  and  initially  sets 
the  probability  of  picking  each  example  to  be  1/A^. 
After  the  first  network,  however,  these  probabilities 
change  to  emphasize  misclassified  instances.  A  large 
number  of  extensive  empirical  studies  have  shown  that 
these  are  highly  successful  methods  that  nearly  always 
generalize  better  than  their  individual  component  pre¬ 
dictors  (Bauer  k  Kohavi  1998:  Maclin  k  Opitz  1997; 
Quinlan  1996).  Neither  approach  is  appropriate  for 
our  domain  since  we  are  data  poor  and  cannot  afford 
to  waste  training  examples;  however,  we  are  feature 
rich  and  can  afford  to  create  diversity  by  instead  vary¬ 
ing  the  inputs  to  the  learning  algorithms.  Varying  the 
feature  subsets  to  create  a  diverse  set  of  accurate  pre¬ 
dictors  is  the  focus  of  the  next  section. 

3.2  THE  GEFS  ALGORITHM 

The  goal  of  our  algorithm  is  to  find  a  set  of  feature 
subsets  that  creates  an  ensemble  of  classifiers  (neural 
networks  in  this  study)  that  maximize  equation  1  while 
minimizing  equation  2.  The  space  of  candidate  sets  is 
enormous  and  thus  is  particularly  well  suited  for  ge- 


Table  1:  The  Gefs  algorithm. 

GOAL;  Find  a  set  of  input  subsets  to  create  an  accu¬ 
rate  and  diverse  classifier  ensemble. 

1.  Using  varying  inputs,  create  and  train  the  initial 
j)opulation  of  classifiers. 

2.  Until  a  stopping  criterion  is  reached: 

(a)  Use  genetic  operators  to  create  new  networks. 

(b)  Measure  the  diversity  of  each  network  with 
respect  to  the  current  population. 

(c)  Normalize  the  accuracy  scores  and  the  diver¬ 
sity  scores  of  the  individual  networks. 

(d)  Calculate  fitness  of  each  population  member. 

(e)  Prune  the  population  to  the  N  fittest  net¬ 
works. 

(f)  Adjust  A. 

(g)  The  current  population  is  the  ensemble. 


netic  algorithms.  Table  1  summarizes  our  recent  algo¬ 
rithm  (Opitz  1999)  called  Gefs  (for  Genetic  Ensemble 
Feature  Selection)  that  uses  GAs  to  generate  a  set  of 
classifiers  that  are  accurate  and  diverse  in  their  predic¬ 
tions.  Gefs  starts  by  creating  and  training  its  initial 
popiilation  of  networks.  The  representation  of  each  in¬ 
dividual  of  our  population  is  simply  a  dynamic  length 
string  of  integers,  where  each  integer  indexes  a  partic¬ 
ular  feature.  We  create  networks  from  these  strings 
by  first  having  the  input  nodes  match  the  string  of 
integers,  then  creating  a  standard  single-hidden-layer, 
fully  connected  neural  network.  Our  algorithm  then 
creates  new  networks  by  using  the  genetic  operators 
of  cro.ssover  and  mutation. 

Gefs  trains  these  new  individuals  using  backpropoga- 
tion.  It  adds  new  networks  to  the  population  and 
then  scores  each  population  member  with  respect  to 
its  prediction  accuracy  and  diversity.  Gefs  normalizes 
these  scf)res,  then  defities  the  fitness  of  each  population 
member  (t)  to  be: 

FitJirss,  =  Accuracyi  +  A  Diversityi  (5) 

where  A  defines  the  tradeoff  between  accuracy  and  di¬ 
versity.  Finally.  Gefs  prunes  the  population  to  the  N 
most-fit  members,  then  repeats  this  process.  At  every 
point  in  time,  the  current  ensemble  consists  of  sim¬ 
ply  averaging  (with  equal  weight)  the  predictions  of 
the  output  of  each  member  of  the  current  population. 
Thus  as  the  popidatioii  evolves,  so  does  the  ensemble. 

(lofine  accuracy  to  be  network  i  s  training-set  accu¬ 


racy.  (One  may  use  a  validation-set  if  there  are  enough 
training  instances.)  We  define  diversity  to  be  the  av¬ 
erage  difference  between  the  prediction  of  our  compo¬ 
nent  classifier  and  the  ensemble.  We  then  separately 
normalize  both  terms  so  that  the  values  range  from 
0  to  1.  Normalizing  both  terms  allows  A  to  have  the 
same  meaning  across  domains. 

It  is  not  always  clear  at  what  value  one  should  set  A; 
therefore,  we  automatically  adjust  A  based  on  the  dis¬ 
crete  derivatives  of  the  ensemble  error  E,  the  average 
population  error  E,  and  the  average,  diversity  D  within 
the  ensemble.  First,  we  never  change  A  if  E  is  decreas¬ 
ing;  otherwise  we  (a)  increase  A  if  .E  is  not  increasing 
and  the  population  diversity  D  is  decreasing;  or  (b) 
decrease  A  if  .E  is  increasing  and  D  is  not  decreasing. 
We  started  A  at  1.0  for  the  experiments  in  this  article. 
The  amount  A  changes  is  10%  of  its  current  value. 

We  create  the  initial  population  by  randomly  choosing 
the  number  of  features  to  include  in  each  feature  sub¬ 
set.  For  classifier  i,  the  size  of  each  feature  subset  (iV,  ) 
is  independently  chosen  from  a  uniform  distribution 
between  1  and  twice  the  number  of  original  features 
in  the  dataset.  We  then  randomly  pick,  with  replace¬ 
ment,  Ni  features  to  include  in  classifier  I’s  training 
set.  Note  that  some  features  may  be  picked  multiple 
times  while  others  may  not  be  picked  at  all;  replicat¬ 
ing  inputs  for  a  neural  network  may  give  the  network 
a  better  chance  to  utilize  that  feature  during  training. 
Also,  replicating  a  feature  in  a  genome  encoding  allows 
that  feature  to  better  survive  to  future  generations. 

Our  crossover  operator  uses  dynamic-length,  uniform 
crossover.  In  this  case,  we  chose  the  feature  subsets  of 
two  individuals  in  the  current  population  proportional 
to  fitness.  Each  feature  in  both  parent’s  subset  is  in¬ 
dependently  considered  and  randomly  placed  in  the 
feature  set  of  one  of  the  two  children.  Thus  it  is  pos¬ 
sible  to  have  a  feature  set  that  is  larger  (or  smaller) 
than  the  largest  (or  smallest)  of  either  parent’s  fea¬ 
ture  subset.  Our  mutation  operator  works  much  like 
traditional  genetic  algorithms;  we  randomly  replace  a 
small  percentage  of  a  parent’s  feature  subset  with  new 
features.  With  both  operators,  the  network  is  trained 
from  scratch  using  the  new  feature  subset;  thus  no  in¬ 
ternal  structure  of  the  parents  are  saved  during  the 
crossover. 

4  RESULTS 

We  tested  the  utility  of  combining  our  ajjproach  for 
generating  numerous  hierarchical  theoretical  descri{)- 
tors  of  compounds  with  our  approach  for  filtering 
these  descriptors  with  Gefs  by  modeling  the  acute 


aquatic  toxicity  (LC50)  of  a  congeneric  set  of  69  ben¬ 
zene  derivatives.  The  data  was  taken  from  the  work 
of  Hall,  Kier  and  Phipps  1984  where  acute  aquatic 
toxicity  was  measured  in  fathead  minnow  {Pimephales 
promelas).  Their  data  was  compiled  from  eight  other 
sources,  as  well  as  some  original  work  which  was  con¬ 
ducted  at  the  U.S.  Environmental  Protection  Agency 
(USEPA)  Environmental  Research  Laboratory  in  Du¬ 
luth,  Minnesota.  This  set  of  chemicals  was  composed 
of  benzene  and  68  substituted  benzene  derivatives. 

Table  2  gives  our  results.  We  studied  three  approaches 
for  modeling  toxicity:  (1)  giving  all  theoretical  descrip¬ 
tors  to  a  neural  network,  (2)  reducing  the  feature  set 
in  a  traditional  previously  published  (Gute  k  Basak 
1997)  manner,  and  (3)  using  our  new  genetic  algorithm 
technique  on  the  entire  feature  set  to  create  a  neu¬ 
ral  network  ensemble.  Results  for  our  approaches  are 
from  leave-one-out  experiments  (i.e.,  69  training/ test 
set  partitions).  Leave-one-out  works  by  leaving  one 
data  point  out  of  the  training  set  and  giving  the  re¬ 
maining  instances  (68  in  this  case)  to  the  learning  algo¬ 
rithms  for  training.  (It  is  worth  noting  that  each  mem¬ 
ber  of  the  ensemble  sees  the  same  68  training  instances 
for  each  training/test  set  partition  and  thus  ensembles 
have  no  unfair  advantage  over  other  learners.)  This 
process  is  repeated  69  times  so  that  each  example  is 
a  part  of  the  test  set  once  and.  only  once.  Leave-one- 
out  tests  generalization  accuracy  of  a  learner,  whereas 
training  set  accuracy  tests  only  the  learner’s  ability  to 
memorize.  Generalization  error  from  the  test  set  is  the 
true  test  of  accuracy  and  is  what  we  report  here. 

We  first  trained  neural  networks  using  all  95  param¬ 
eters.  The  networks  contained  15  hidden  units  and 
we  trained  the  networks  for  1000  epochs.  We  normal¬ 
ized  each  input  parameter  to  a  values  between  0  and  1 
before  training.  Additional  parameter  settings  for  the 
neural  networks  included  a  learning  rate  of  0.05,  a  mo¬ 
mentum  term  of  0.1,  and  weights  initialized  randomly 
between  -0.25  and  0.25.  With  all  95  input  parameters, 
the  neural  networks  obtained  a  test-set  correlation  co¬ 
efficient  between  predicted  toxicity  and  measured  toxi¬ 
city  (explained  variance)  of  7?^  =  0.868  and  a  standard 
error  of  0.29.  Target  toxicity  measurements  ranged 
from  3.04  to  6.37. 

Our  first  method  for  feature-set  reduction  follows  the 
work  of  Gute  and  Basak  1997  on  toxicity  domains. 
Their  method  begins  by  using  the  VARCLUS  method 
of  SAS  1998  to  select  subsets  of  topostructural  and 
topochemical  parameters  for  QSAR  model  develop¬ 
ment.  With  this  method,  the  set  of  topological  in¬ 
dices  is  first  partitioned  into  two  distinct  sets,  the 
topostructural  indices  and  the  topochemical  indices. 


Table  2:  Relative  effectiveness  of  statistical  and  neural 
network  methods  in  estimating  LC50  of  69  benzene 
derivatives. 


Method 

R" 

Standard  Error 

NN  with  95  inputs 

0.868 

0.29 

VARCLUS 

0.825 

0.32 

NN  with  Gefs 

0.893 

0.27 

To  further  reduce  the  number  of  independent  variables 
for  model  construction,  the  sets  of  topostructural  and 
topochemical  indices  were  further  divided  into  subsets, 
or  clusters,  based  on  the  correlation  matrix  using  the 
VARCLUS  procedure.  This  procedure  divides  the  set 
of  indices  into  disjoint  clusters,  such  that  each  clus¬ 
ter  is  essentially  unidimensional.  From  each  cluster 
we  selected  the  index  most  correlated  with  the  clus¬ 
ter,  as  well  as  any  indices  which  were  poorly  corre¬ 
lated  with  their  cluster  (i?^  <  0.70).  The  varia])le 
clustering  and  selection  of  indices  was  performed  inde¬ 
pendently  for  both  the  topostructural  and  topochem¬ 
ical  indices.  This  procedure  resulted  in  a  set  of  five 
topostructural  indices  and  a  set  of  nine  topochemical 
indices.  These  indices  were  combined  with  the  three 
geometric  and  six  quantum  chemical  parameters  de¬ 
scribed  earlier.  Their  approach  then  applied  linear  re¬ 
gression  to  these  23  parameters.  This  study  found  that 
an  accurate  linear  regression  model  for  acute  aquatic 
toxicity  required  descriptors  from  all  four  levels  of  the 
hierarchy:  topostructural,  topochemical,  geometrical 
and  quantum  chemical.  This  model  utilized  seven  de¬ 
scriptors  and  obtained  an  explained  variance  (/?^)  of 
0.863  and  a  standard  error  of  0.30  on  the  whole  data 
set  used  as  a  training  set.  Our  leave-one-out  experi¬ 
ment  gave  an  =  0.825  and  a  standard  error  of  0.32. 

Finally  we  applied  our  genetic  algorithm  technique, 
Gefs,  using  all  95  parameters.  The  parameter  set¬ 
tings  for  the  networks  in  the  ensemble  were  the  same  as 
the  settings  for  the  single  networks  in  the  first  exper¬ 
iment.  Parameter  settings  for  the  genetic  algorithm 
portion  of  Gefs  includes  a  mutation  rate  of  50%,  a 
population  size  of  20,  a  A  =  1.0,  and  a  search  length 
of  100  networks  (20  networks  for  the  initial  population 
and  80  networks  created  from  crossover  and  mutation). 
While  the  mutation  rate  may  seem  high  as  compared 
with  traditional  genetic  algorithms,  certain  aspects  of 
our  approach  call  for  a  higher  mutation  rate  (such  as 
the  criterion  of  generating  a  population  that  cooper¬ 
ates  as  well  as  our  emphasis  on  diversity);  other  muta¬ 
tion  values  were  tried  during  our  pilot  studies.  With 
this  approach,  we  obtained  a  test-set  correlation  coef¬ 
ficient  of  =  0.893  and  a  standard  error  of  0.27;  the 
initial  population  of  20  networks  obtained  a  test-set 


=  0.835  and  a  standard  error  of  0.31. 

5  DISCUSSION  AND  FUTURE 
WORK 

The  correlation  coefficient  between  the  predicted  value 
from  the  computational  model  and  the  target  value 
derived  from  the  toxicity  test  is  an  extremely  informa¬ 
tive  metric  of  accuracy  in  this  case.  The  exact  numeric 
\-alue  of  most  toxicity  tests  is  not  as  important  as  the 
relative  ordering  and  spread  of  these  rallies.  Thus, 
a  perfect  correlation  (/?^  =  1.0)  between  the  compu¬ 
tation  model  and  target  toxicity  shows  the  computa¬ 
tional  model  is  as  informative  as  the  toxicity  obtained 
from  a  battery  of  expensive  and  time-consuming  tests 
-  regardless  of  the  standard  error.  Note  the  standard 
error  of  0.27  is  fairly  good,  given  the  toxicity  measure¬ 
ments  ranged  from  3.04  to  6.37. 

While  the  neural  network  technique  and  the  standard 
data-reduction  technique  obtained  decent  correlation 
with  measured  toxicity,  our  ensemble  technique  was 
about  20%  closer  to  perfect  correlation.  Note  that 
Gefs  produces  an  accurate  initial  population  and  that 
running  Gefs  longer  with  our  genetic  operators  can 
further  increase  performance.  Thus  our  approach  can 
be  viewed  as  an  “anytime”  learning  algorithm.  Such 
a  learning  algorithm  should  produce  a  good  concept 
quickly,  then  continue  to  search  concept  space,  report¬ 
ing  tlie  new  “best"  concept  whenever  one  is  found 
(Opitz  k  Shavlik  1997).  This  is  important  since,  for 
most  hazard  assessment,  an  expert  is  willing  to  wait 
for  days,  or  even  weeks,  if  a  learning  system  can  pro¬ 
duce  an  improved  model  for  predicting  toxicity. 

Our  results  demonstrate  a  very  important  point:  that 
our  method  is  able  to  accurately  predict  toxicity  di¬ 
rectly  from  structure.  Compared  to  the  actual  bat¬ 
tery  of  tests  necessary  to  measure  toxicity,  a  computer 
model  is  much  cheaper,  much  faster,  and  does  not  have 
a  negative  impact  on  the  environment.  It  is  important 
to  also  note  that  the  computer  model  does  not  have  to 
be  the  final  measurement  for  hazard  as.sessment;  addi¬ 
tional  tests  can  be  run  on  compounds  that  are  either 
flagged  by  the  model,  or  require  more  tests  by  the  na¬ 
ture  of  their  use  (such  as  a  benzene  derivative  that  may 
become  a  standard  fuel).  Not  only  can  good  computer 
models  become  filters,  they  will  probably  be  the  only 
viable  option  for  processing  all  registered  chemicals. 

While  the  method  proposed  here  has  proven  effective, 
there  is  much  future  work  that  needs  to  be  completed. 
For  instance,  we  plan  to  test  our  method  on  other  data 
sets  of  chemical  derivatives;  investigate  other  ensemble 
feature  selection  techniques;  investigate  variants  to  our 


genetic  algorithm  approach,  and  finally  investigate  the 
utility  of  other  descriptors,  such  as  bio-descriptors. 

6  CONCLUSIONS 

In  this  paper  we  presenteil  a  novel  approach  for  cre¬ 
ating  a  computer  model  for  hazard  as.sessnient .  Our 
approach  works  by  first  extracting  a  hierarchy  of  theo¬ 
retical  descriptors  derived  from  the  structure  of  a  com¬ 
pound,  then  filtering  the  numerous  possible  descriptors 
with  a  genetic  algorithm  approach  to  ensemble  fea¬ 
ture  selection.  We  tested  the  utility  of  our  approach 
by  modeling  the  acute  aquatic  toxicity  (LCso)  of  a 
congeneric  set  of  69  benzene  derivatives.  Our  results 
demonstrate  the  ability  of  our  approach  to  accurately 
predict  toxicity  directly  from  structure.  Thus  our  new 
algorithm  further  increases  the  applicability  of  com¬ 
puter  models  to  the  problem  of  predicting  chemical 
activity  directly  from  its  structure. 
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Abstract 

A  novel  QSAR  studv  of  benzamidines  complement-inhibitory  activity  and  benzene  derivatives  acute  toxicity  is 
reported  and  a  new  efficient  method  for  selecting  descriptors  is  used.  Complement-inhibitory  activity  QSAR  models 
of  benzamidines  contain  from  one  to  five  descriptors.  The  best,  according  to  fitted  and  cross-validated  statistical 
parameters,  is  shovvn  to  be  the  five-descriptor  model.  Models  with  a  higher  number  of  indices  did  not  improve  over 
the  five-descriptor  model.  The  benzene  derivatives  structure -toxicity  models  involve  up  to  seven  linear  descnptors^ 
Multireeression  models,  containing  up  to  ten  nonlinear  descriptors,  are  also  reported  for  the  sake  of  companson  with 
previously  obtained  additivity  models.  Comparison  with  benzamidine  complement-inhibitory  activity  models  and 
with  benzene  derivatives  toxicity  models  from  the  literature  favors  our  novel  approach.  ©  1999  Elsevier  Science 
Ireland  Ltd.  All  rights  reserved. 


Keywords  QSAR  study:  Complement-inhibitory  activity;  Benzene:  Five-descriptor  model 


1.  Introduction 

In  our  recent  papers  a  hierarchical  QSAR  (quantita¬ 
tive  structure-activity  relationship)  approach  was  used 
to  model  the  complement -inhibitory  activity  of  benza¬ 
midines  (Basak  et  al.,  1999a)  and  the  acute  aquatic 
toxicities  of  benzene  derivatives  (Gute  and  Basak,  1997, 
Basak  et  aL»  1999c).  The  hierarchical  QSAR  approach 
uses  topological  (partitioned  into  topostructural  and 
topochenucal),  geometric  and  quantum-chemical  de¬ 
scriptors  in  a  stepwise  fashion  to  build  increasingly 
more  complex  structure-property-activity  models 
(Basak  et  al„  1997.  1999b).  Now  we  report  the  use. 


•  Corresponding  author. 


with  the  same  aim,  of  a  new  efficient  approach  for 
selecting  the  best  QSAR  models  using  multivariate 
regression  (MR)  (Lucic  and  Trinajstic,  1999;  Lucic  et 
al.,  1999a)  and  a  standard  approach  for  variable  selec¬ 
tion  and  model  generation  used  in  CODESSA  (Ka- 
tntzky  et  al,  1999;  Lucic  et  al.,  1999b).  Sometime  ago 
Hansch  and  Yoshimoto  (Hansch  and  Yoshimoto,  1974) 
carried  out  a  QSAR  study  on  the  complement— in¬ 
hibitory  potency  of  benzamidines  using  their  own  ap¬ 
proach.  After  10  years.  Hall  et  al.  (Hall  et  al.,  1984) 
carried  out  a  QSAR  study  on  the  toxicities  of  benzene 
derivatives  using  de  novo  analysis  (Free  and  Wilson, 
1964;  Kubinyi  and  Kehrhahn,  1976),  and  derived  an 
additivity  model  for  66  compounds  (they  excluded  three 
compounds  as  outliers).  We  will  analyze  their  models 
and  compare  to  ours. 
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Ob^rv'ed  and  calculated  (cross-validated,  CV,  and  fitted.  FIT)  complement-inhibitory  activities  1/ log  C  of  IPS  benzamidines 


No.  X 


1/log  C 

Observed  Calculated  (CV)-?  Calculated  (FIT)’ 


2- CH5 

3.4- (CH,b 
H 

3- OH 
3-CF, 

3-NO; 

3-Br 

3-CH3 

3-OCH3 

3-CH;C»H5 

3.5- (CH3)2 
3-OC3H, 

3-/-C5H|| 

3-OC.H, 

3-C4H, 

3-CH=CHC4H5 

3-OCH2C4H5 

3-(CH2):C4H5 

3-0C4H,3 

3-0(CH2),0C«H5 

3-0(CHj);0C4H5 

3-QH3 

3-0(CH,),0C4H4-4-C00H 

3-OC,H„ 

j-O-i-CjH,, 

3-0(CH:)20C,oH,-a 

3-0(CHj)40C«H4-4-NH, 

3-(CH2)4QH5 

3-0(CHj)3OC4H4-4-N0: 

3-0(CHj),0C4H4-4-NH; 

3-(CH;):-4-C,H4N 

3-0(CH2)3<x:4H5 

3-0(CH;)3C4H5 

3-{CHj),-3-C,H4N 

3-(CHj)4C4H4^-NHAc 

3-(CH,),-2-C5H4N 

3-0(CHj),0C4H4-2-NH: 

3-0(CHj),0C«H4-4-NHAc 

3-(CH;)4-3-C,H4N 

3-0(CHj)4C4H5 

3-0(CHj)30C4H4-3-NHAc 

3-0(CH2),0C»H,-3.4-C1; 

3-0(CHO,OC4H4-3-NH; 

3-0(CH,),0C4H4-2-NHC0C4H4-4-S0;F 

3-0(CH:)30C4H4-2-NHC0C«H5 

3-0(CH:)30C4H4^-0CH3 

3-0(CH,)40C4H4-4-NHC0NHC«H4-4-S0:F 

3-0(CH;)30C4H4-2-NHC0CsH,-2-0CH,-5-S0;F 

3-0(CH;)30CjH4^-C1 

3-0(CH2)30C4H4-2.N0j 

3-0(CHj)30C4H4-3-N0: 

3-0(CH,)30CtH4-3-0CH3 

3-0(CHj)30C*H4-2-NHC0C»H3-2-C1-6-S0;F 

3-0(CH2)30C4H4-2-NHC0NHC4H5 

3-0(CH2)20C4H4-2-NHC0NHC4H3-2-C1-5-S02F 


-0.444 

-0.417 

-U.4iy 

-0.425 

-0.423 

-0.424 

-0.418 

-0.424 

-0.423 

-0.415 

-0.439 

-0.434 

-0.410 

-0.378 

-0.382 

-0.410 

-0.392 

-0.395 

-0.405 

-0.399 

-0.400 

-0.398 

-0.399 

-0.399 

-0.397 

-0.401 

-0.401 

-0.373 

-0.343 

-0.346 

-0.361 

-0.375 

-0.369 

-0.355 

-0.358 

-0.358 

-0.355 

-0.344 

-0.345 

-0.351 

-0.340 

-0.341 

-0.338 

-0.355 

-0.353 

-0.339 

-0.324 

-0.325 

-0.331 

-0.324 

-0.324 

-0.330 

-0.332 

-0.331 

-0.329 

-0.318 

-0.319 

-0.325 

-0.286 

-0.287 

-0.323 

-0.314 

-0.315 

-0.323 

-0.366 

-0.359 

-0.321 

-0.296 

-0.297 

-0.320 

-0.327 

-0.326 

-0.318 

-0.338 

-0.335 

-0.312 

-0.255 

-0.262 

-0.306 

-0.288 

-0.289 

-0.302 

-0.315 

-0.313 

-0.301 

-0.282 

-0.282 

-0.300 

-0.298 

-0.298 

-0.299 

-0.318 

-0.318 

-0.299 

-0.295  , 

-0.295 

-0.296 

-0.290 

-0.290 

-0.294 

-0.298 

-0.298 

-0.294 

-0.281 

-0.282 

-0.291 

-0.300 

-0.299 

-0.283 

-0.288 

-0.288 

-0.278 

-0.270 

-0.270 

-0.276 

-0.284 

-0.284 

-0.276 

-0.277 

-0.277 

-0.270 

-0.260 

-0.260 

-0.265 

-0.271 

-0.271 

-0.265 

-0.283 

-0.283 

-0.265 

-0.247 

-0.247 

-0.265 

-0.258 

-0.258 

-0.262 

-0.275 

-0.274 

-0.260 

-0.236 

-0.237 

-0.260 

-0.226 

-0.227 

-0.257 

-0.287 

-0.286 

-0.257 

-0.279 

-0.279 

-0.257 

-0.268 

-0.268 

-0.256 

-0.255 

-0.255 

-0.255 

-0.247 

-0.248 

-0.255 

-0.260 

-0.259 

-0.250 

-0.246 

-0.246 

Date  Roduction  snd  Division  of  tee  Topologicol  Indicos 

Initially,  all  TIs  were  transformed  by  the  natural  logarithm  of  the  index  plus  one.  This 
was  done  since  the  scale  of  some  indices  may  be  several  orders  of  magnitude  greater 
than  that  of  other  indices  and  other  indices  may  equal  zero.  The  geometric  indices  were 
transformed  by  the  natural  logarithm  of  the  index  for  consistency,  the  addition  of  one 
was  unnecessary. 

The  set  of  TIs  was  partitioned  into  two  distinct  sets:  topostructural  indices  and 
topochemical  indices.  Topostructural  indices  are  indices  which  encode  information 
about  the  adjacency  and  distances  of  atoms  (vertices)  in  molecular  structures  (graphs) 
irrespective  of  the  chemical  nature  of  the  atoms  involved  in  the  bonding  or  factors  like 
hybridization  states  of  atoms  and  number  of  core/valence  electrons  in  individual  atoms. 
Topochemical  indices  are  parameters  which  quantify  information  regarding  the  topology 
(connectivity  of  atoms)  as  well  as  specific  chemical  properties  of  the  atoms  comprising  a 
molecule.  Topochemical  indices  are  derived  from  weighted  molecular  graphs  where 
each  vertex  (atom)  is  properly  weighted  with  selected  chemical/physical  properties 
These  sets  of  the  indices  are  shown  in  Table  I. 

To  reduce  the  number  of  independent  variables  that  were  used  for  model 
construction  in  the  smaller  sets  of  compounds,  the  sets  of  topostructural  and 
topochemical  indices  were  further  divided  into  subsets,  or  clusters,  based  on  the 
correlation  matrix  using  the  SAS  procedure  VARCLUS  [38].  The  VARCLUS  procedure 
divides  the  set  of  indices  into  disjoint  clusters  so  that  each  cluster  is  essentially 
unidimensional.  From  each  cluster  we  select  the  index  most  correlated  with  the  cluster, 
as  well  as  any  indices  which  are  poorly  correlated  with  the  cluster  (r  <  0.70).  These 
indices  are  then  used  in  model  construction.  The  variable  clustering  and  selection  of 
indices  is  performed  independently  for  both  the  topostructural  and  topochemical 
subsets. 


III.  DEVELOPMENT  OF  HIERARCHICAL  OSAR  MODELS 

In  the  development  of  hierarchical  OSAR  models,  between  two  and  four  sets  of  indices 
have  been  used.  A  schematic  of  this  method  is  given  in  figure  1  and  the  SAS  procedure 
REG  is  used  to  conduct  the  all-subsets  regression  analyses  [38].  Final  model  selection 
from  the  all-subsets  regression  is  based  on  the  results  for  both  RSQUARE  and  CP 
(Mallow’s  Cp  statistic).  The  hierarchy  begins  with  the  simplest  indices,  the  topostructural. 
After  developing  our  initial  model  utilizing  the  topostructural  indices,  the  level  of 
complexity  is  increased  one  step.  To  the  indices  included  in  the  best  topostructural 
model,  all  of  the  topochemical  indices  are  added  and  modeling  is  conducted  using  the 
combined  set  of  parameters.  Likewise,  the  indices  included  in  the 

[Insert  Figure  1  here] 

best  model  from  this  procedure  are  combined  with  the  geometrical  indices  and  modeling 
is  conducted  once  again.  Finally,  in  some  studies  we  have  included  quantum  chemical 
parameters  calculated  by  MOPAC.  The  parameters  are  added  to  the  best  model 
selected  from  modeling  with  the  combination  of  topostructural,  topochemical  and 
geometrical  parameters,  and  all  subsets  regression  is  used  to  find  the  best-fit  model.  In 
some  of  our  studies  we  have  also  used  each  level  of  the  hierarchy  individually  to 
compare  the  results  of  using  only  one  higher-level  set,  e.g,  geometrical  indices,  alone  to 
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69  be'n^ene  derivatives  and  their  observed  and  calculated  (cross-validaied,  CV.  and  fitted.  FIT)  fathead  minnow  toxicities,  expressed 
as  “  logfLCso)  _ _ _ _ _ _ 

No.  Compound  —  log(LC5o) 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


23 

24 

25 

26 
27 


28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 


Benzene 

Bromobenzene 

Chlorobenzene 

Phenol 

Toluene 

U2-Dichlorobenzene 

l.3'Dichlorobenzene 

1,4-Dichlorobenzene 

2- Chlorophenol 

3- Chlorotoluene 

4- Chlorotoluene 

1 .3- Dihydroxybenzene 
3-Hydroxyanisole 

2- MethyIphenol 

3- Methylphenol 

4- Methylphenol 
4-Nitrophenol 

1 .4- Dimethoxybenzene 

1 .2- Dimethylbenzene 

1.4- Dimcthylben2ene 
2-Nitrotolucne 
3*Nitrotolucne 
4-nitroioluene 

1.2- Dinitrobenzene 

1.3- Dmitroben2ene 

1.4- Dinitrobenzene 
2-Methyl-3-nitroaniline 
2-Methyl-4-nitroaniline 
2Methyl-5-nitroaniline 

2- Methyl-6-nitroaniline 

3- Methyl-6-nitroaniline 

4- Mcthyl-2-nitroaniline 
4-Hydroxy-3-nitroaniline 
4-Methyl-3-nitroaniline 

1.2.3- Tnchloroben2ene 

1.2.4- Trichlorobcnzcne 

1 .3.5- TrichIorobenzene 
2ADichlorophenol 

3.4- Dichloroioluene 

2.4- Dichloroiolucne 
4.Chloro-3-methylphenol 

2.4- Dimethylphcno! 
2.6-Dimcthylphenol 

3.4- Dimethylphenol 

2.4- Dinitrophenol 

1 .2.4- Trimethylbenzene 

2.3- Dinitrotolucne 

2.4- Dinitrotoluenc 

2.5- Dinitrotoluene 

2.6- Dinitrotoluene 

3.4- Dinitrotoluene 

3.5- Dinitrotoluene 

1 .3.5- Trinitrobcnzene 
2-Methyl-3,5-dinitroanilinc 


Observed 


Calculated  (CV)"* 


Calculated  (FIT)'’ 


3.40 

3.29 

3.32 

3.89 

4.04 

4,01 

3.77 

3.75 

3.75 

3.51 

3.31 

3.35 

3.32 

3.51 

3.49 

4.40 

4.33 

4.33 

4.30 

4.10 

4.12 

4.62 

4.80 

4.77 

4.02 

4.01 

4.01 

3.84 

3.72 

3.73 

4.33 

4.11 

4.13 

3.04 

3.31 

3.28 

3.21 

3.13 

3.14 

3.77 

3.62 

3.62 

3.29 

3.52 

3.51 

3.58 

3.64 

3.64 

3.36 

3.68 

3.66 

3.07 

3.01 

3.01 

3.48 

3.84 

3.81 

4.21 

3.94 

3.97 

3.57 

3.70 

3.69 

3.63 

3.67 

3.66 

3.76 

3.71 

3.71 

5.45 

4.95 

5.09 

4.38 

4.12 

4.15 

5.22 

4.83 

4.91 

3.48 

3.74 

3.73 

3.24 

3.50 

3.47 

3.35 

3.80 

3.77 

3.80 

3.76 

3.76 

3.80 

3.61 

3.62 

3.79 

3.78 

3.78 

3.65 

3.51 

3.52 

3.77 

3.78 

3.78 

4.89 

4.84 

4.84 

5.00 

5.02 

5.02 

4.74 

4.36 

4.45 

4.30 

4.53 

4.52 

4.74 

4.46 

4.48 

4.54 

4.57 

4.56 

4.27 

4.27 

4.27 

3.86 

3.74 

3.76 

3  75 

3.75 

3.75 

3.90 

3.90 

3.90 

4.04 

4.03 

4.04 

4.21 

4.07 

4.09 

5.01 

5.29 

5.21 

3.75 

4.29 

4.27 

5.15 

4,89 

4.93 

3.99 

4.43 

4.41 

5.08 

5.29 

5.23 

3.91 

4.25 

4.23 

5.29 

5.29 

5.29 

4.12 

4.23 

4.22 
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Table  2  (Continued) 


No. 

Compound 

-logCLCjo) 

Observed 

Calculated  (CV)* 

Calculated  (FIT)* 

55 

2-Meihyl-3.6-dinitroaniline 

5.34 

4.59 

4.64 

56 

3-Methyi-2.4-dinitroaniline 

4.26 

3.97 

4.00 

57 

5-Methyl-2.4-dinitroaniline 

4.92 

3.88 

3.97 

58 

4-VIethyl-2.6-dinitroaniiine 

4.21 

4.76 

4.72 

59 

5-Methyl-2.6-dinitroaniline 

4.18 

4.64 

4.61 

60 

4-Methyi-3,5-dinitroaniline 

4.46 

4.33 

4.34 

61 

2.4.6-TribromophenoI 

4.70 

4.98 

4.82 

62 

1 .2.3,4.Tetrachlorobenzene 

5.43 

5.55 

5.53 

63 

1 .2.4.5-Tetrachlorobenzene 

5.85 

5.76 

5.77 

64 

2.4.6-TrichlorophenoI 

4.33 

4.68 

4.64 

65 

2-Methyl-4,6-dinitrophenol 

5.00 

4.45 

4.48 

66 

2,3.6-Trinitrotoluene 

6.37 

6.39 

6.38 

67 

2,4,6-Tnniirotoluene 

4.88 

5.32 

5.26 

68 

2,3,4. 5-Teirachlorophenol 

5.72 

5.64 

5.65 

69 

2.3.4.5,6-Pentachlorophenol 

6.06 

6.01 

6.03 

*  CV  and  FIT  values  are  calculated  using  Eq.  (10). 


which  was  achieved  by  the  orthogonalization  of  de¬ 
scriptors.  because  in  the  onhogonal  basis  the  computa¬ 
tion  of  R  is  much  faster  and  simpler  (Lucic  ei  al., 
1995a, b.c;  Lucic,  1997).  Namely,  in  the  case  one  has  the 
MR  model  based  on  the  set  of  /  orthogonalized  de¬ 
scriptors  di  (/=! .  /),  the  correlation  coefficient 

between  the  experimental  values  of  modeled  activity  A 
and  the  values  estimated  by  the  model  can  be 
calculated  in  a  very  simple  way  (Eq.  (1)): 


Z  R 

I  m  \ 


1  : 


(1) 


w’here  Ri  is  the  correlation  coefficient  between  each 
orthogonalized  descnptor  di  and  the  modeled  activity 
A.  For  e.xample,  using  this  procedure  it  takes  28  CPU 
min  on  Hewlett-Packard  9(X)0/E55  computer,  which  is 
configured  as  a  server,  to  select  the  best  MR  model 
w'ith  five  out  of  104  descriptors  among  10*  possible 
models. 


3.  Results  and  discussion 


(cross- validated)  correlation  coefficient,  F  is  /“-value,  S 
is  the  standard  error  and  is  the  cross-validated 
(leave-one-out)  standard  error  of  estimate  (root-mean- 
square  error),  both  with  N-2  in  the  denominator.  This 
model  is  only  slightly  better  than  the  earlier  obtained 
one-descriptor  model,  but  with  a  different  descriptor 
(Basak  et  al.,  1999a): 

1  /log  C  -  -  0.6428(  ±  0.0 1 29)  -h  0.0490(  ±  0.001 7)^^  IV 
n=W5  /?  =  0.943  =  0.940  5-0.0196 

5,,  =  0.0200  F=  824  (3) 

where  ^^H^is  the  3-D  Wiener  number  for  the  hydrogen- 
suppressed  structures  computed  using  their  geometric 
distance  matrices  (Bogdanov  et  al,,  1989).  Close  to  this 
model  is  a  model  with  3-D  Wiener  number  computed 
for  structures  containing  all  atoms  including  hydrogens 
(Bosnjak  et  al.,  1989)  (n  =  105,  R  =  0.941,  =  0.939, 

5  =  0.0199  5,,  =  0.0203). 

The  best  two-descriptor  model  of  the  benzamidine 
structure-complement-inhibitory  activity  is: 


3.1.  QSAR  of  benzamidine s 

The  best  one-descnptor  structure-complement-in¬ 
hibitory'  activity  model  of  benzamidines  obtained  is: 

1  log  C  =  -  0.9332(  ±  0.0229)  -h  0.4395(  ±  0.01 52)//^' 

n=105  R  =  0.943  =0.941  5  =  0.0195 

5e..  =  0.0199  /*=  832  (2) 

where  //''  is  the  graph-vertex  complexity  (Basak,  1987), 
n  is  the  number  of  benzamidine  denvatives  considered, 
R  is  the  correlation  coefficient,  is  the  leave-one-out 


1  log  C  =  -  0.6878(  ±  0.0 1 75)  -h  0. 1 327(  ±  0.0367)  W 

-h0.1864(  ±0.0380)^^ 

ai=105  R  =  0.950  /?^.  =  0.947  5  =  0.0184 

‘5ev  =  0.0189  F=467  (4) 

where  W  is  the  2-D  Wiener  number  (Wiener,  1947). 
The  best  three-descriptor  model  is  given  by: 

1  /log  C  =  -  0.6400(  ±  0.0239)  -h  0. 1 273(  ±  0.0355)  W 

-h0.0103(±0.0037)?9 

4- 0.1 698(  ±0.0372)^*^ 
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Table  3 

Descriptions  of  all  considered  descriptors  and  symbols  of  only 

those  descriptors  involved  in  the  models 

Information  index  for  the  magnitude  of  distances 
between  all  possible  pairs  of  vertices  of  a  graph 
Mean  information  index  for  the  magnitude  ot 
distance 

W  Wiener  index,  the  half-sum  of  the  otT-diagonal 

elements  of  the  molecular  distance  matrix 
Degree  complexity 
Graph  vertex  complexity 
Graph  distance  complexity 
Information  content  of  the  distance  matrix  parti¬ 
tioned  by  frequency  of  occurrences  of  distance  / 
Information  content  of  the  hydrogen-suppressed 
graph  at  its  maximum  neighborhood  of  vertices 
Order  of  neighborhood  when  ICr  reaches  its 
maximum  value  for  the  hydrogen-filled  graph 
A  Zagreb  group  parameter,  the  sum  of  square  of 
degree  over  all  vertices 
A  Zagreb  group  parameter,  the  sum  of  cross- 
product  of  degrees  over  all  neighboring  (con¬ 
nected)  vertices 

IC,  Mean  information  content  of  a  graph  based  on 
the  rth  (r  =  0-6)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

SIC,  Structural  information  content  for  rth  (r  =  0-6) 
order  neighborhood  of  vertices  in  a  hydrogen- 
filled  graph 

CIC,  Complementary  information  content  for  rth  ir  = 
0-6)  order  neighborhood  of  vertices  in  a  hydro¬ 
gen-filled  graph 

Path  connectivity  index  of  order  h  -  0-6 
Cluster  connectivity  index  of  order  h  =  3-6 
Chain  connectivity  index  of  order  h  =  6 
Path-cluster  connectivity  index  of  order  h  =  4-6 
Bond  path  connectivity  index  of  order  h  =  0-6 
Bond  cluster  connectivity  index  of  order  h  =  3-6 
Bond  chain  connectivity  index  of  order  /i  =  6 
Bond  path-cluster  connectivity  index  of  order 
^  =  4-6 

Valence  path  connectivity  index  of  order  h  -  0-6 
Valence  cluster  connectivity  index  of  order  h  = 
3-6 

^Xch  Valence  chain  connectivity  index  of  order  /»  =  6 

'7hc  Valence  path-cluster  connectivity  index  ot  order 

h  ^  4-6 

Pj  Number  of  paths  of  length  /=  0-10 

Balaban’s  J  index  based  on  distance 
Balaban's  J  index  based  on  relative  electronega¬ 
tivities 

Balaban’s  J  index  based  on  relative  covalent 
radii 

Balaban’s  J  index  based  on  bond  types 
Energy  of  the  highest  occupied  molecular  orbital 
Energy  of  the  second  highest  occupied  molecular 
orbital 

^lumo  Energy  of  the  lowest  unoccupied  molecular  or¬ 
bital 

Energy  of  the  second  lowest  unoccupied  molecu¬ 
lar  orbital 


Table  3  (Continued) 

A//f  Heat  of  formation 

^  Dipole  moment 

van  der  WaalSs  volume 

3-D  Wiener  index  for  the  hydrogen-filled  geometric 
distance  matrix 

3-D  Wiener  index  for  the  hydrogen-suppressed  ge¬ 
ometric  distance  matrix 


n  =  105  R  =  0.954  R^.  =  0.949  S  =  0.0177 

S„.  =  0.0185  335  (5) 

where  is  the  path  of  length  nine.  P<,  could  be  omitted 
from  Eq.  (5)  because  the  related  value  of  error  of 
regression  coefficient  is  relatively  large  comparing  to 
the  value  of  regression  coefficient.  Then  Eq.  (5)  simply 
converts  into  Eq.  (4).  The  best  four-descriptor  model  is; 

1  /log  C  =  -  0.6999(  +  0.0194)  -I-  0.1327(  ±  0.0354)  W 

-l-5.0332(±  1.2285)‘;^Jh 

-5.1120  (±  1.2486)«x^h 

+  0.1885(10.0359)^'^  IT 

n=  105  R  =  0.957  =  0.953  5  =  0.0170 

5„,  =  0.0177  5=272  (6) 

where  ‘/feh  and  */ch  denote  the  bond-chain  and  valence- 
chain  connectivity  indices  of  order  si.x,  respectively. 

Hansch  and  Yoshimoto  (Hansch  and  Yoshimoto, 
1974)  published,  25  years  ago,  the  following  four-de¬ 
scriptor  model  for  benzamidine  derivatives  inhibiting 
complement  (the  model  is  given  in  their  notation); 

logd  C)  =  0.15(±0.03)(MR-  1.2) 

+  1.07(±0,13)(D-l)  +  0.52(  +  0.28)(D-2) 

+  0.43(  ±  0.  l4)(D-3)  +  2.425(  ±0.12) 

«=  108  5=0.935  5  =  0.258  (7) 


where  MR  is  the  molar  refract! vity  of  substituents  at 
positions  1  and  2,  taken  from  the  compilation  by 
Hansch  et  al.  (Hansch  et  al..  1973)  or  computed,  while 
D-1,  D-2,  and  D-3  are  indicator  variables  for  the 
presence  or  absence  of  three  kinds  of  the  substructural 
units  in  a  given  benzadimine.  To  compare  fitted  statisti¬ 
cal  parameters  of  our  four-descriptor  model  (Eq.  (6)) 
with  those  of  model  given  by  Eq.  (7).  we  retransformed 
our  results  into  a  log  (1/C)  scale  used  by  Hansch  and 
Yoshimoto.  Thus,  we  obtained  statistical  parameters 
(5=0.941  and  5  =  0.237)  that  are  comparable  with 
their  result.  However.  Hansch  and  Yoshimoto  consid¬ 
ered  108  benzamidine  derivatives  and  we  only  consid¬ 
ered  105.  This  discrepancy  is  caused  by  problematic 
data  for  three  compounds  which  in  our  case  are  dis¬ 
carded  from  the  set  of  benzamidine  derivatives  (Basak 
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et  al.,  1999a).  But,  the  nature  of  descriptors  used  in 
these  two  types  of  models  is  different.  Descriptors  used 
by  us  are  calculated  solely  from  the  structures  of  stud¬ 
ied  molecules  while  the  Hansch-Yoshimoto  parameters 
(molar  refractivities  of  substituents)  are  experimentally- 
based. 

Finally,  the  five-descriptor  model  is; 

1  /log  C  -  1 .5264(  ±  0.3534)  +  0.6323(  ±  0.0936)(IC): 

-  1 .6788(  ±  0.2720)(1C)6  ~  1 .4540(  ±  0.2043) 
(SIC),  -  0.4239(  ±  0.0680)(CIC)6  +  0.1286 
{±  0.0149)^^ 

n=\0S  0.963  =  0.957  5  =  0.0158 

5,,  =0.0170  F=  253  (8) 

where  (IC)^  and  (IC)^  denote  the  mean  information 
content  of  structure  based  on  the  second-  and  sixth- 
order  neighborhood  of  atoms,  including  hydrogens,  in 
the  structure,  respectively,  (SIC),  and  (CIC)^  are,  re¬ 
spectively,  the  structural  information  content  for  the 
first  order  neighborhood  and  complementary  informa¬ 
tion  content  for  the  sixth  order  neighborhood  of  atoms, 
including  hydrogens,  in  the  structure.  (IC),.,  (SIC),,  and 
(CIC),  are  molecular  complexity  indices  introduced 
some  times  ago  by  one  of  us  (Basak.  19,87)  for  use  in 
predictive  pharmacology  and  to.xicology. 

It  is  interesting  to  note  that  the  3-D*  Wiener  number 
is  present  in  all  models  given  here,  except  in  the  very 
best  model  with  a  single  descriptor,  although  is  present 
m  the  next  best  single-descriptor  model.  This  is  not 
surprising  because  this  descriptor  has  shown  to  be  very 
useful  in  the  structure -property -activity  modeling 
(Bogdanov  et  al.,  1989;  Bosnjak  et  al.,  1991:  Mihalic 
and  Tnnajstic,  1991;  Nikolic  et  al.,  1991;  Trinajstic, 
1992). 

The  models  containing  more  decriptors  did  not  out- 
p>erform  the  above  five-descriptor  model.  Thus,  the 
model  with  five-descriptors  {Eq.  (8)).  selected  from  the 
initial  set  of  descnptors.  is  the  best  QSAR  model, 
according  to  the  calculated  cross-validated  statistical 
parameters,  for  predicting  the  benzamidine  structure- 
complement-inhibitory  activity.  This  model  is  better 
than  one-descriptor  model  previously  obtained  using 
hierarchical  approach  (Basak  et  al.,  1999a).  However, 
according  to  F- values  one-descriptor  models  selected  in 
this  paper  and  our  previous  work  (Basak  et  al..  1999a) 
appear  to  be  better  models  than  the  model  with  five-de¬ 
scriptors.  But,  the  F-value  is  calculated  only  from  the 
fitted  correlation  coefficient  R  and  taking  into  account 
the  number  of  parameters  optimized  in  the  model. 
Because  it  is  accepted  (Ortiz  et  al.,  1997)  that  the 
cross-validated  statistical  parameters  give  better  evi¬ 
dence  into  the  model  quality  than  fitted  statistical 
parameters,  our  final  conclusions  are  based  on  cross- 


validated  statistical  parameters,  although  the  prediction 
for  compounds  from  an  external  data  set  would  be  the 
best  way  of  model  quality  testing.  A  plot  between  the 
experimental  and  predicted  values,  calculated  in  the 
cross-validation  procedure  using  Eq,  (8),  of  l/logC  is 
given  in  Fig.  2.  Computed  (fitted  and  leave-one-out 
cross-validated)  l/log  C  values  are  given  in  Table  I. 

3.2.  QSAR  of  benzene  derivatives 

The  best  linear  five-descriptor  structure-toxicity 
model  of  benzene  derivatives  selected  by  CROMRsel 
program  is: 

-log(LC5o) 

=  5.2032(  ±  0.546)  +  0.8488(  ±  0. 106)^9 

-h  1 .191%  ±  0. 1 83)"/'p,  -  0.4439(  ±  0.0523)£,,„^ 

-  0, 1379(  ±  0.0195);^  -  0.296 1(  ±  O.milf^W^ 

«  =  69  R  =  0.921  =  0.914  5  =  0.287  5^.  =  0.312 

F=ll  (9) 

where  is  the  path  of  length  nine,  valence 
path-cluster  connectivity  index  of  order  four,  Elumo  is 
the  energy  of  the  lowest  unoccupied  molecular  orbital, 
p  is  dipole  moment,  and  is  the  3-D  Wiener 

number  for  the  hydrogen-filled  structures  computed 
using  their  geometric  distance  matrices  (Bogdanov  et 
al..  1989).  This  model  has  two  descriptors  fewer  than 
the  best  model  obtained  by  hierarchical  approach  (see 
Cute  and  Basak,  1997)  and  almost  the  same 

statistical  parameters. 

The  best  linear  seven-descriptor  model  is: 

—  log(LC5o) 

=  4.4100(  ±  0.809)  +  0.8637(  ±  0.0988)P9 
-f  2.5278(  ±  0.833)“/"  -  3.1248(  ±  0.655)"/" 

-h  1 .5628(  ±  0.372)^/ 'p,  0.44 1 57(  ±  0.05 1 )£,,^o 

-  0.1364(  ±  0.01 8)  -  0.34054(  +  OMlf^W^ 

n^69  R  =  0.940  =  0.925  5  =  0.262  5^,.  =  0.291 

F=66  (10) 

where  “/'  and  "/'  denote  valence  path  connectivity 
indices  of  order  two  and  four,  respectively,  and  ^/p^  is 
the  valence  path-cluster  connectivity  index  of  order  six. 
Other  descriptors  are  the  same  as  those  from  five-de¬ 
scriptor  model  (Eq.  (9)).  This  model  (R^  =  0.884,  F  = 
66,  5  =  0.26)  is  better  than  the  seven-descriptor  model 
obtained  by  hierarchical  procedure  (see  Cute  and 
Basak,  1997)  (/?-  =  0.863,  F=50,  5  =  0.30),  and  one 
can  see  that  these  two  models  contain  three  identical 
descriptors:  P9  and  p.  Fitted  and  cross-vali¬ 

dated  predicted  values  for  all  benzene  derivatives  ob¬ 
tained  using  Eq.  (10)  are  given  in  Table  2.  A  plot 
between  the  experimental  and  predicted  values,  calcu- 


/pscriptl;/sco4/jobs2/ELSEVlER/cac/week.39/Pcac374y.009 


Thu  Oct  14  11:40:39  1999 


Page  1 


5,C  Basak  et  ai  I  Computers  <&  Chemistry  000  (1999)  000-000 


calculated { cross -val i dated) 1/logC 

F>g,  2.  A  plot  of  observed  versus  calculated  (cross-validated)  Mog  C  complement-inhibitory  activity  of  benzamidines. 


lated  in  the  cross-validation  procedure  using  Eq.  (10), 
of  -  logCLCjo)  is  given  in  Fig.  3. 

We  also  found  several  seven-descriptor  linear  multi- 
regression  models  with  better  statistical  prameter  than 
the  best  seven-descriptor  model  of  Cute  and  Basak  (see 
Gute  and  Basak.  1997).  One  of  them  is  very  similar  to 
the  model  given  as  Eq.  (10)  and  involving  the  following 
set  of  descriptors  H*.  P9,  n, 

Table  3  for  description  of  descriptors),  and  possessing 
the  following  statistical  parameters  R  =  0.9398.  = 

0.9245,  5  =  0.262.  =  0.292.  66). 

In  addition,  we  perform  modeling  in  order  to  com¬ 
pare  our  seven-descriptor  model  with  the  additivity 
model  (using  eight  terms,  i.e.  eight  optimized  parame¬ 
ters)  derived  by  Hall  et  al.  (Hall  et  al.,  1984).  To  do  this 
we  omitted  from  the  data  set  compounds  53.  57  and  65. 
which  were  identified  in  by  Hall  et  al.  as  outliers.  For 
66  compounds  statistical  parameters  of  seven-descriptor 
model  (Eq.  (10))  are;  R  =0.955,  R^=0M3,  S=  0.225, 
Sev  =  0.255  f=87).  This  parameters  are  better  than 
those  for  additivity  models  obtained  by  Hall  et  al. 
(R  =  0.951,  5  =  0.249,  F=61). 


4.  Concluding  remark 

Presented  results  show  that  the  optimum  way  to 
carry  out  QSAR  modeling  is  by  selecting  the  best 
descriptors  in  (linear,  as  was  the  case  here,  or  nolinear 
(Lucic  and  Trinajstic,  1999)  multiregression  models. 
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Abstract 

A  current  trend  in  quantitative  structure-property/activity  relationship  studies 
fQSPR/QSAR)  studies  is  the  use  of  theoretical  molecular  descriptors  that  can  e 
calculated  directly  from  molecular  structure.  One  advantage  of  such  descrifrtors  is  that 
they  can  be  calculated  for  any  chemical  structure,  real  or  hypothetical  Topological 
indices  (TIs)  or  numerical  graph  invariants  constitute  an  important  subset  of  these 
theoretical  descriptors.  TIs  are  derived  from  different  classes  of  weighted  graphs 
representing  various  levels  of  chemical  structural  information.  They  are  numerical 
quantifiers  of  molecular  topology  and  encode  information 

branching  pattern,  cyclicity,  and  symmetry  of  molecular  graphs.  The  Wiener  index 
different  types  of  connectivity  indices,  and  complexity  or  information  theoretic  topological 

indic6S  hsv©  b66n  wid6ly  usGd  in  QSAR/QSPR  rGSGsrch.  ,  ,  .  •  *  i. 

We  have  been  invdved  in  the  use  of  TIs  in  QSAR/QSPR  model  development  to 
estimate  pharmacological,  physicochemical,  and  toxicological  properties  of  diverse  sets 
oSecu^rXre  rLntly.  we  have  developed  a  hierarchical  approach  in  the  use  of 
theoretical  descriptors  where  topological,  geometrical,  and  quantum  chemical  indices 
Lre  used  The  goal  of  this  approach  has  been  to  use  the  simplest  descriptors  first  and  to 
only  use  more  complex  descriptors  if  necessary.  For  this  reason  the 
divided  into  two  subsets:  a)  topostructural  indices  (TSIs),  the  topological  indices  which 
are  defined  on  the  skeletal  molecular  graph  and  which  do  not  distinguish  arnor^  the 
various  atoms  or  bonds  present  in  the  molecule,  and  b)  topochemical  indices  (TCIs), 

vvhich  exolicitlv  encode  information  regarding  atom  and  bond  types. 

m  wfSer  we  will  discuss  the  utility  of  TIs.  geometrical  indices  and  quantum 
chemicTparameters  in  hierarchical  QSAR  studies.  The  resute  of  Judies  w^re  the 
various  levels  of  indices  are  used  in  estimating  physicochemical,  biological,  and 
toxicological  properties  of  different  sets  of  molecules  will  be  presented. 

I.  INTRODUCTION 

A  recent  interest  in  pharmaceutical  drug  design  and  hazard  assessment  of  chemicals  is 
fhe^relS  on  of  environmental,  physicochemical,  toxicological,  and  pharmaralogical 
orLertiTcheS  “direct  from  their  structure  11-11].  Early  quantitative  structur^ 
strudure  activity  relationship  (QSAR)  studies  by  Hansch  and  oBiers 
orooerties  and  physicochemical  substituent  constants  for  the  predidion  of 
rrv^niex  ohvsicochemical  biomedioinal  and  toxicological  properties  [12].  Such  Pf°P®^ 
oroe^  coSton  fo  useful  only  when  properties  necessary  for  prediction  are  available 
foT  Jl^he-^to^nder  consideration.  In  the  field  of  environmental  risk  asses^enk 
most  chemicals  do  not  have  the  data  required  for  proper  hazard  estimation  [13].  In 


contemporary  drug  design,  one  can  produce  large  (real  or  virtual)  combinatorial  libraries 
of  chemicals  for  screening.  Most  of  these  chemicals  will  have  no  physicochemical  data 
and  predictive  methods  based  on  experimental  data  are  of  no  use  in  this  situation. 
Therefore,  there  is  a  need  for  the  development  of  QSAR  methods  using  nonempirical 
parameters,  i.e.,  parameters  that  can  be  calculated  from  the  molecular  structure. 
Topological  indices  (TIs),  the  various  molecular  size  and  shape  Indices  as  well  as 
quantum  chemical  parameters  fall  in  this  category. 

Recently  we  have  developed  a  new  hierarchical  approach  to  QSAR  using 
parameters  which  are  algorithmically  defined,  i.e.,  which  can  be  computed  from 
structure  using  computer  software  [14-19],  We  have  successfully  used  four  classes  of 
computed  parameters,  viz.,  topostructural,  topochemical,  geometrical,  and  quantum 
chemical  parameters,  in  the  development  of  QSAR  models  using  a  hierarchical 
approach  {vide  infra).  This  approach  was  found  to  be  quite  useful  in  the  estimation  of 
different  properties. 

In  this  chapter  we  will  review  the  results  of  our  hierarchical  QSAR  studies  pertaining 
to  the  prediction  of  physicochemical,  biological,  and  toxicological  properties  of  different 
groups  of  chemicals. 


II.  CALCULATION  OF  PARAMETERS 
Computation  of  Topological  Indices 

topological  indices  used  in  this  study  have  been  calculated  by  POLLY  2.3  [20]  which 
calculates  a  total  of  102  indices.  These  indices  include  the  Wiener  index  [21],  the 
connectivity  indices  of  Kier  and  Hall  [2],  and  Randic  [22],  information  theoretic  indices 
defined  on  distance  matrices  of  graphs  [23,24],  a  set  of  parameters  derived  on  the 
neighborhood  complexity  of  vertices  in  hydrogen-filled  molecular  graphs  [25-27],  and 
Balaban’s  J  indices  [28-30].  Table  I  provides  brief  definitions  for  the  indices  included  in 
this  study. 

Computation  of  Geometrical  Indices 

Van  der  Waal’s  volume,  Vw,  [31-33]  was  calculated  using  Sybyl  6.2  [34].  The  3-D 
Wiener  numbers  [35]  were  calculated  by  Sybyl  using  an  SPL  (Sybyl  Programming 
Language)  program  developed  in  our  laboratory.  Calculation  of  3-D  Wiener  numbers 
consists  of  the  sum  entries  in  the  upper  triangular  submatrix  of  the  topographic 
Euclidean  distance  matrix  for  a  molecule.  The  3-D  coordinates  for  the  atoms  were 
determined  using  CONCORD  3.2.1  [36].  Two  variants  of  the  3-D  Wiener  number  were 
calculated;  ^°Wh  and  For  hydrogen  atoms  are  included  in  the  computations 
and  for  hydrogen  atoms  are  excluded  from  the  computations. 

Computation  of  Quantum  Chemical  Parameters 

The  quantum  chemical  parameters  Ehomo,  Ehomoi.  Elumo.  Elumoi.  AH^  and  m  were 
calculated  for  all  of  the  following  semi-empirical  Hamiltonians:  AMI,  PM3,  MNDO, 
MINDO/3.  These  parameters  were  calculated  by  MOPAC  6.00  in  the  SYBYL  interface 
[37].  One  difficulty  was  encountered  in  using  the  MINDO/3  Hamiltonian. 
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Data  Reduction  and  Division  of  the  Topological  Indices 

Initially,  all  TIs  were  transformed  by  the  natural  logarithm  of  the  index  plus  one.  This 
was  done  since  the  scale  of  some  indices  may  be  several  orders  of  magnitude  greater 
than  that  of  other  indices  and  other  indices  may  equal  zero.  The  geometric  indices  were 
transformed  by  the  natural  logarithm  of  the  index  for  consistency,  the  addition  of  one 
was  unnecessary. 

The  set  of  TIs  was  partitioned  into  two  distinct  sets:  topostructural  indices  and 
topochemical  indices.  Topostructural  indices  are  indices  which  encode  information 
about  the  adjacency  and  distances  of  atoms  (vertices)  in  molecular  structures  (graphs) 
Irrespective  of  the  chemical  nature  of  the  atoms  involved  in  the  bonding  or  factors  like 
hybridization  states  of  atoms  and  number  of  core/valence  electrons  in  individual  atoms. 
Topochemical  indices  are  parameters  which  quantify  information  regarding  the  topology 
(connectivity  of  atoms)  as  well  as  specific  chemical  properties  of  the  atoms  comprising  a 
molecule.  Topochemical  indices  are  derived  from  weighted  molecular  graphs  where 
each  vertex  (atom)  is  properly  weighted  with  selected  chemical/physical  properties. 

These  sets  of  the  indices  are  shown  in  Table  I. 

To  reduce  the  number  of  independent  variables  that  were  used  for  model 
construction  in  the  smaller  sets  of  compounds,  the  sets  of  topostructural  and 
topochemical  indices  were  further  divided  into  subsets,  or  clusters,  based  on  the 
correlation  matrix  using  the  SAS  procedure  VARCLUS  [38].  The  VARCLUS  procedure 
divides  the  set  of  indices  into  disjoint  clusters  so  that  each  cluster  is  essentially 
unidimensional.  From  each  cluster  we  select  the  index  most  correlated  with  the  cluster, 
as  well  as  any  indices  which  are  poorly  correlated  with  the  cluster  (r  <  0.70).  These 
indices  are  then  used  in  model  construction.  The  variable  clustering  and  selection  of 
indices  is  performed  independently  for  both  the  topostructural  and  topochemical 

subsets. 


III.  DEVELOPMENT  OF  HIERARCHICAL  OSAR  MODELS 

In  the  development  of  hierarchical  OSAR  models,  between  two  and  four  sets  of  indices 
have  been  used.  A  schematic  of  this  method  is  given  in  figure  1  and  the  SAS  procedure 
REG  is  used  to  conduct  the  all-subsets  regression  analyses  [38].  Final  model  selection 
from  the  all-subsets  regression  is  based  on  the  results  for  both  RSQUARE  and  CP 
(Mallow’s  Cp  statistic).  The  hierarchy  begins  with  the  simplest  indices,  the  topostructural 
After  developing  our  initial  model  utilizing  the  topostructural  indices,  the  level  of 
complexity  is  increased  one  step.  To  the  indices  included  in  the  best  topostructural 
model,  all  of  the  topochemical  indices  are  added  and  modeling  is  conducted  using  the 
combined  set  of  parameters.  Likewise,  the  indices  included  in  the 

[insert  Figure  1  here] 

best  model  from  this  procedure  are  combined  with  the  geometrical  indices  and  modeling 
is  conducted  once  again.  Finally,  in  some  studies  we  have  included  quantum  chemi^ 
parameters  calculated  by  MOPAC.  The  parameters  are  added  to  the  best  model 
selected  from  modeling  with  the  combination  of  topostructural,  topochemical  arid 
geometrical  parameters,  and  all  subsets  regression  is  used  to  find  the  best-fit  model.  In 
some  of  our  studies  we  have  also  used  each  level  of  the  hierarchy  individually  o 
compare  the  results  of  using  only  one  higher-level  set,  e.g,  geometrical  indices,  alone  to 


3 


determin©  th©  d©gr©©  of  contribution  to  modoling  from  th©  giv©n  s©t.  Thus,  th©r©  may  b© 
as  many  as  s©v©n  final  mod©ls  in  a  hi©rarchical  study  to  illustrat©  th©  individual 
contributions  of  th©  thr©©  high©r-l©v©l  s©ts  of  indic©s,  as  w©ll  as  th©  four  mod©l  from  th© 
st©pwis©  proc©dur©  of  th©  hi©rarchical  mod©ling. 

IV.  HIERARCHICAL  QSAR/QSPR  STUDIES 

Th©  hi©rarchical  m©thod  has  b©©n  us©d  in  d©v©loping  QSAR  models  for  predicting  a 
wide  variety  of  properties.  Th©  following  are  examples  from  our  previous  studies 
employing  th©  hierarchical  approach  in  th©  construction  of  useful  models. 

Physicochemical  Properties 

Three  large  sets  of  chemicals  have  been  used  to  model  physicochemical  properties, 
viz.,  normal  boiling  point,  lipophilicity  (logP),  and  normal  vapor  pressure.  The  normal 
boiling  point  data  was  a  subset  of  the  Toxic  Substances  Control  Act  (TSCA)  Inventory 
[13]  for  which  measured  normal  boiling  point  data  were  available  and  where  HBi,  a 
simple  measure  of  the  hydrogen  bonding  potential  of  a  chemical,  was  equal  to  zero. 
This  resulted  in  a  set  of  1023  diverse  chemicals  [14].  For  this  particular  set,  only  the  first 
three  levels  of  the  hierarchical  approach  were  used,  mainly  due  to  the  large  amount  of 
computational  time  necessary  to  generate  quantum  chemical  parameters  for  a  set  of 
over  1000  chemicals.  Eight  topostructural  indices  were  selected  for  the  first  model  (Eq. 
1).  The  second  level  of  the  hierarchy  resulted  in  the  retention  of  two  of  those 
topostructural  indices  and  the  addition  of  six  topochemical  indices  (Eq.  2).  Finally,  the 
addition  of  geometric  indices  resulted  in  a  ten  parameter  model  using  the  two 
topostructural  indices,  the  six  topochemical  indices,  and  two  of  the  geometric  indices 
(Eq.3).  The  results  of  this  modeling  are  presented  below  (Eq.  1-3); 

BP  =  -21.9  +  30.6(W)  -21.5(0)  +  69.9(^x)  +  35.8(®x)  -  106.5(®xc)  -  96.1(®xch) 
-17.7(Vc)+19.5(Pio)  Eq-1 

n  =  1023,  =  0.812,  s  =  39.7°C,  F  =  547 


BP  =  -332.9  +  134.6(®x)  +  10.9(Pio)  +  IIO.O(ICo)  -  133.8(V)  -  80.2(Vc) 

+  176.5(V)  +  44.8(V)  +  16.8(Vpc)  ^9-2 

n=  1023,  r^  =  0.961.  s=  18.0'’C,  F=  3151 

BP  =  -285.7  +  125.3(®x)  +  10.6(Pio)  +  74.5(ICo)  -  125.0(V)  -  86.3(Vc) 

+  175.3(V)  +  49.1(V)  +  18.7(Vpc)  -9.1(^°Wh)  +  8.1(^°W)  Eq.3 
n  =  1023,  =  0.963,  s  =  17.6°C.  F  =  2650 


From  the  three  equations  presented,  it  is  clear  that  the  replacement  of  six  topostructural 
indices  with  six  topochemical  indices  greatly  enhanced  the  predictive  power  of  the 
model,  while  the  addition  of  the  geometric  parameters  did  not  add  much  to  the  model.  A 
scatterplot  of  experimental  versus  predicted  boiling  point  from  equation  3  is  shown  in 
figure  2. 

[Insert  Figure  2  here] 
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The  liDODhilicity  data  are  a  subset  of  219  chemicals  derived  from  the  STARLIST  set 
with  kTgP  v"etween  -2  to  5.5  obtained  from  CLOGP  [39]  and  HB,  equal  to  zero 
ri41  This  subset  was  chosen  to  examine  the  effectiveness  of  model  based  on 
topological  indices  in  the  prediction  of  lipophilicity  for  compounds  that  do  not  have 
explicit  hydrogen-bonding  centers.  Compounds  were  chosen  within  the  range  of  logP 
values  described  to  avoid  the  problematic  nature  of  compounds  having  exceptionally 
hiah  values  for  lipophilicity.  As  with  the  boiling  point  models,  only  the  first  three  levels  of 
the  hierarchy  were  applied  to  modeling  lipophilicity.  Seven  topostructural  indices  were 
initially  selected  (Eq.4),  and  again,  only  two  were  retained  with  the  addition  of  eight 
topochemical  indices  (Eq.  5).  In  equation  6,  with  the  addition  of  two  geometric 
parameters,  an  additional  topostructural  index  is  removed  from  the  model.  These 
equations  (4-7)  are  presented  below; 

logP  =  -1.42  +  1.08(W)  -  1.58(^x)  +  -  0.92(®xc)  -  0.32(P7)  + 


+  1.97(J) 


n  =  219,  r^  =  0.789,  s  =  0.54,  P  =  112 


loqP  =  -2.13  -  0.20(\)  +  0.18(Pio)  -  1.86(ICo)  +  1.33(C1C2)  -  0.92(CIC3) 

-  1  36(V)  +  5.76(V)  -  l.QSCx')  +  0.54(V)  “  0.39(Vc)  Eq.  5 

n  =  219,  ?  =  0.908,  s  =  0.36,  F  =  206 

logP  =  -5.60  +  0.1 9(Pio)  -  1 .46(lCo)  +  1 .09(CIC2)  -  0.77(CIC3)  -  1  -36(  x  ) 

+  5  -  3.41(V'')  +  0.55(V)  "  0.41(Vc)  +  1  -lOCVw)  -  0.17(^°W)  Eq,  6 

n  =  219,  =  0.912,  s  =  0.35,  F  =  194 

These  three  equations  show  similar  results  as  those  for  the  modeling  of  normal  boiling 
pomf  The  repLement  of  topostructural  indices  with  an  equal  or  greater  number  of 
topochemical  indices  results  in  marked  improvement  in  the  predichve  power  of  the 
model  while  the  addition  of  geometric  indices  resulted  in  only  a  minor  inriprovement. 
Figure  3  presents  a  plot  of  the  experimental  logP  values 

predicted  from  equation  6.  The  219  chemicals  and  their  observed  and  predicted  values 
for  loqP  have  been  presented  previously  in  the  literature  [14]. 

[Insert  Figure  3  here] 

The  476  chemicals  in  the  normal  vapor  pressure  data  [161  are  = 
inventory  taken  from  the  ASTER  (Assessment  Tools  for  the  Evaluation  of  Risk) 
database  [40].  This  is  a  diverse  subset  of  chemicals  all  have  vapor  pressure  (Pw)  d  t 

measured  at25'C  and  ranging  between  3-10,000  mmHg.  ,,,, 

The  first  three  levels  of  the  hierarchical  method  have  been  employed,  however  th 
additton  oTgeoLtric  parameters  to  the  modeling  process  did  not  result  in  the  selection 
of  a  novel  model  and  so  there  is  no  geometric  model  reported. 

Iog,o(p.«,)  =  4.88  0.20(0)  -  2.56('x)  +  0.49(‘'xc)  +  0.79(V)  +  0.98(P,o)  Eq.  7 

loyioip.*,;  ^  ^  =  0.51 5,  s  =  0.53,  F  =  99.7 

log,o(Pv«,)  =  8.44  -  1.77(’x)  +  1.25(P,o)  -  5.69(IC,)  i-  3.91(102)  - 
*1.41(Yc)-1.70('x'') 
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n  =  476,  =  0.793,  s  =  0.34,  F  =  224.0 


As  can  be  seen  from  equation  7,  five  topostructural  indices  were  initially  selected  to 
model  normal  vapor  pressure.  The  addition  of  the  topochemical  indices  resulted  in  the 
retention  of  two  topostructural  indices  and  the  addition  of  five  topochemical  indices  (Eq. 
8).  As  was  seen  for  the  other  two  physicochemical  properties,  viz.,  normal  boiling  point 
and  lipophilicity,  the  predictive  power  of  the  model  is  greatly  enhanced  by  the  addition  of 
the  topochemical  indices.  A  scatterplot  of  experimental  versus  predicted  normal  vapor 
pressure,  based  on  equation  8,  is  shown  in  Figure  4.  These  results  are  adequate, 
however,  as  can  be  seen  from  Figure  5  while  the  residuals  show  fairly  uniform  scatter 
when  plotted  against  the  dependent  variable  there  are  some  significant  outliers  and  the 
data  tends  to  be  somewhat  skewed  to  the  lower  end  of  the  vapor  pressure  range. 

[Insert  Figure  4  here] 

[Insert  Figure  5  here] 

Biological  Properties 

Two  smaller  sets  of  congeneric  chemicals  have  been  used  in  the  study  of  biological 
properties.  The  smaller  of  the  two  sets  [19]  consisted  of  sixty  polycyclic  aromatic 
hydrocarbons  for  which  24-hour  dermal  penetration  (DP)  data  were  available  from  the 
work  of  Roy  et  al  [41].  For  the  purposes  of  this  study,  all  four  levels  of  the  hierarchical 
method  were  employed.  Only  two  equations  are  being  presented  since  the  addition  of 
geometric  and  quantum  chemical  parameters  to  the  modeling  procedure  did  not  result 
in  the  formulation  of  improved  QSAR  equations. 

DP  =  224.1  -67.9(Po)  Eq.  9 

n  =  60,  =  0.675,  s  =  7.4,  F  =  120.6 

DP  =  179.7 -78.8(\‘’)  Eq.  10 

n  =  eoy=  0.695,  s  =  7.1 ,  F  =  1 32.0 


Equation  9  shows  the  model  resulting  from  the  topostructural  modeling.  A  one 
parameter  model  which  explains  67.5%  of  the  variance  was  generated.  A  small 
improvement  is  seen  in  the  model  resulting  from  the  addition  of  the  topochemical 
indices  (Eq.  10),  in  which  the  topostructural  index  is  replaced  by  the  topochemical 
index,  Figure  6  presents  a  scatterplot  of  experimental  dermal  penetration  versus  the 
predicted  results  from  equation  10. 

[Insert  Figure  6  here] 

The  second  set  of  biological  data  studied  using  the  hierarchical  method  was  a  set  of 
107  benzamidines  [18]  that  act  as  inhibitors  of  the  complement  system,  collected  from 
the  literature  by  Hansch  and  Yoshimoto  [42].  The  base  structure  for  the  benzamidines  is 
presented  in  figure  6  and  the  side-chains  and  activity  values  have  been  published 
previously  [18].  The  large  size  of  these  molecules  made  the  calculation  of  quantum 
chemical  indices  prohibitively  time  consuming.  As  a  result,  the  first  three  levels  of  the 
hierarchical  modeling  procedure  were  used  for  this  study. 

[Insert  Figure  7  here] 

1/logC=  1.1245  +  0.4989(1°)  Eq.  11 
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Eq.  12 


n  =  lOSy  =  0.884,  s  =  0.0200,  F  =  785 


1/log  C  =  -0.6428  +  0.0490(^‘^W) 
r?  =  105,  ^  =  0.889,  s  =  0.0196,  F  -  824 

A  single  topostructural  index  provided  a  strong  correlation  with  the  inhibitory  activity  of 
these  large  compounds  (Eq.  11).  This  one  index  modeled  the  activity  so  well,  that  the 
addition  of  topochemical  indices  did  not  add  significantly  to  the  predictive  power  of  the 
model.  Finally,  with  the  addition  of  geometric  parameters  to  the  modeling  of  inhibitory 
activity,  it  was  found  that  one  geometric  parameter  provided  a  slightly  better  correlation 
with  activity  than  did  the  topostructural  index  (Eq.  12),  explaining  89%  of  the  variance  in 
the  data.  The  results  of  this  final  model  (Eq.  12)  are  shown  in  Figure  8  as  a  scatterplot 
of  experimental  versus  predicted  activity. 

[Insert  Figure  8  here] 

Toxicological  Properties 

Two  sets  of  compounds  have  been  studied  using  the  hierarchical  modeling  for 
toxicological  properties.  The  first  set  consists  of  acute  aquatic  toxicity  data  for  69 
benzene  derivatives  determined  by  the  96-hour  fathead  minnow  toxicity  test  system 
[17].  This  data  was  compiled  by  Hall,  Kier,  and  Phipps  [43]  from  eight  literature  sources 
and  was  supplemented  by  some  original  work  conducted  at  the  U.S.  Environmental 
Protection  Agency  (USEPA)  Environmental  Research  Laboratory  in  Duluth,  Minnesota. 

LCso  =  -7.50  +  3.50(Mi)  -  1 .72(lC)  -  0.52(P8)  +  0.68(P9)  Eq.  1 3 

n  =  69,  =  0.453,  s  =  0.58,  F  =  1 3.3 

LCso  =  23.68  +  5.04(Mi)  +  0.55(P9)  -  43.27(SICo)  -  20.04(CICo)  Eq.  14 

n  =  69,  =  0.783,  s  =  0.36,  F  =  57.9 


LCso  =  0.59  +  5.82(Mi)  +  0.55(P9)  -  14.23(SICo)  -  2.36(^'^Wh)  Eq.  15 

n  =  69, /^  =  0.792,  s  =  0.36,  F  =  61 .1 


LCso  =  -3.83  +  5.97(Mi)  +  0.77(P9)  -  8.26(SICo)  -  1.98(^°Wh)  +  0.41(Elumoi) 

+  0.01(AHr)-0.12(M)  Eq-16 

n  =  69,  =  0.863,  s  =  0.30,  F  =  55.0 


Equation  13  shows  the  results  of  the  initial  modeling  using  topostructural  indices.  Even 
using  four  indices,  the  topostructural  set  did  a  poor  job  of  modeling  acute  toxicity.  The 
addition  of  topochemical  indices  led  to  a  significant  improvement  in  predictive  power, 
with  the  replacement  of  two  topostructural  indices  with  topochemical  indices  (Eq.  14). 
The  geometrical  indices  slightly  improved  the  QSAR  modeling  (Eq.  15);  however,  itv\/as 
the  addition  of  quantum  chemical  indices  which  drastically  improved  the  predictive 
power  of  our  model  (Eq.  16).  The  addition  of  quantum  chemical  indices  increased  the 
variance  explained  by  7.1%  over  the  model  including  geometrical  indices,  resulting  in  an 
overall  explanation  of  86.3%  of  the  variance.  Figure  9  presents  the  scatterplot  of 
experimental  versus  predicted  toxicity  for  these  69  compounds  based  on  the  results  of 
equation  16. 
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[Insert  Figure  9  here] 

A  set  of  520  compounds,  260  mutagens  and  260  non-mutagens,  was  taken  from  the 
literature  [44]  as  a  source  of  mutagenicity  data.  These  data  provided  qualitative 
assessments  of  mutagenicity  based  on  a  positive  or  negative  result  in  the  Ames’ 
mutagenicity  assay.  A  discriminant  function  analysis  (DFA)  was  conducted  on  this  set 
using  the  SAS  procedure  DISCRIM  [38]  to  create  a  function  capable  of  classifying  the 
compounds  as  active  or  inactive.  Based  on  the  results  of  a  previous  study  and  the 
amount  of  time  required  for  the  calculations,  the  quantum  chemical  parameters  were 
excluded  and  indicators  of  molecular  fragments  associated  with  mutagenic  activity  were 
included  [15].  See  the  original  manuscript  for  a  further  discussion  of  the  data  used  in 
this  study  and  the  molecular  fragments  keyed  for  the  analysis.  These  classification 
results,  the  indices  used  in  each  case,  and  brief  notes  on  the  fragment  groups  included 
in  the  final  models  are  presented  in  Table  II. 

[Insert  Table  II  here] 

As  can  be  seen  in  Table  II,  the  topostructural  indices  alone  correctly  classify  over  75% 
of  the  mutagens;  however,  they  only  correctly  classify  57.3%  of  the  non-mutagens.  This 
leaves  over  40%  of  the  non-mutagens  incorrectly  classified.  The  combination  of 
topostructural  and  topochemical  indices  results  in  a  comparable  classification  rate  for 
mutagens  (74.6%)  and  a  significant  increase  (5.8%)  in  the  classification  of  non¬ 
mutagens.  The  addition  of  information  regarding  the  presence  or  absence  of  known 
structural  fragments  associated  with  mutagenic  activity  results  in  a  significant  decrease 
(5.4%)  in  classification  rate  for  mutagens,  from  74.6%  down  to  69.2%.  However,  the 
addition  of  these  structural  fragments  also  increases  the  correct  classification  rate  for 
non-mutagens  increasing  it  from  63.1%  to  71.9%,  and  overall  increase  of  8.7%.  As  a 
result  of  this  dramatic  increase  in  classification  rate  for  non-mutagens,  this  model  was 
retained  and  supplemented  by  the  geometrical  indices.  Addition  of  the  geometric  indices 
brought  the  classification  rate  for  mutagens  up  to  71.5%  (an  overall  decrease  of  4.7% 
from  the  topostructural  model)  and  retained  the  classification  rate  for  non-mutagens  at 
71.9%  (an  overall  increase  of  14.6%  over  the  initial  model).  While  these  results  are  by 
no  means  spectacular,  it  is  a  reasonably  accurate  model  for  the  prediction  of  mutagenic 
activity. 


V.  DISCUSSION 

The  goal  of  hierarchical  QSAR  studies  is  to  investigate  the  relative  roles  of  different 
classes  of  parameters,  viz.,  topostructural  and  topochemical  indices,  3-D  parameters 
and  calculated  quantum  chemical  parameters  in  predicting  different  types  of  molecular 
properties.  It  is  clear  from  the  results  presented  here  that  topostructural  and 
topochemical  indices  explain  most  of  the  variance  in  the  data  for  physicochemical, 
biological  and  toxicological  properties.  In  most  cases  geometrical  and  quantum 
chemical  indices  make  only  marginal  improvements  in  the  predictive  power  of  the 
models.  This  indicates  that  the  easily  calculable  topostructural  and  topochemical  indices 
will  be  an  effective  first  choice  in  QSAR  studies. 

It  is  evident  from  these  studies  that  the  expanded  levels  of  the  hierarchical  method 
are  extremely  useful  for  large,  diverse  sets  of  chemicals  where  there  are  many  factors 
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influencing  the  variation  of  properties  between  chemical  structures.  They  are  also  useful 
in  modelmg  the  more  complex  biological  interactions  involving  the  modulation  of 
toxicants  It  is  interesting  to  note  that  studies  involving  the  inhibition  of  a  specific 
enzymatic  system  or  the  passage  of  large  compounds  through  the  skin  are 
well  using  simply  shape  and  size  descriptors,  and  do  not  seem  to  benefit  significantly 
Torthe  addiflon  of  more  complex  indices.  There  is  s«il  a  ne^  for  better  desCTVtors 
that  will  help  us  to  more  accurately  model  complex  biological  and  toxicological  systems. 

ACKNOWLEDGEMENT 

This  paper  is  contribution  number  236  from  the  Center  for  Water  and  the  Environment 
of  the  Natural  Resources  Research  Institute.  Research  reported  in  this  paper  was 
Supported,  ^  by  grants  F49620-94-1-0401  and  F49620.96-1 -0330  from  the  United 

States  Air  Force. 


REFERENCES 

[1]  Randic,  M.  Nonempirical  Approaches  to  Structure-Activity  Studies.  Int.  J.  Quantum 
Chem:  Quant.  Biol.  Symp.  1984,  11,  137-153. 

[2]  Kier,  L.  B.;  Hall,  L.  H.  Molecular  Connectivity  in  Structure-Activity  Analysis.  Research 
Studies  Press:  Letchworth,  Hertfordshire,  U.K,  1986. 

13]  Rouvray,  D.  H.;  Pandey,  R.  B.  The  Fractal  Nature, 

Physicochemical  Properties  of  Normal  Alkanes.  J.  Chem.  Phys.  1986,  85,  2286-2290. 

[41  Basak  S.  C.  Use  of  Molecular  Complexity  Indices  in  Predictive  Pharmacology  and 
Toxicology;  A  QSAR  Approach.  Med.  Sci.  Res.  1987,  15,  605-609. 

(51  Basak,  S.  C.;  Frane,  C.  M.;  Rosen,  M.  E.;  Magn^son,  ^  R.  ““'f  P'1'  ■?P|'°|'' 
Acute  Toxicity:  A  QSAR  Study  of  Monoketones.  Med.  Sc/.  Res.  1987,  15,  887-888. 

[61  Basak,  S.  C.  Binding  of  Barbiturates  to  Cytochrome  P450:  A  QSAR  Study  Using  Log 
P  and  Topological  Indices.  Med.  Sci.  Res.  1988,  16,  281-282. 

[71  Basak  S  C  \n  Practical  Applications  of  Quantitative  Structure-Activity  Relationships 

mSA^  in  envimnmenlal  ChLsIry  and  Toxicology.  ' 

Eds.;  Kluwer  Academic;  Dordrecht/Boston/London,  1990;  pp.  83-103. 

[8]  Basak,  S.  C.;  Niemi.  G.  J.;  Veith.  G,  D.  Predicting  Properties  of  Molecules  Using 
Graph  Invariants.  J.  Math.  Chem.  1991,  7,  243-272. 

191  Balaban  A  T  •  Basak,  S.  C.;  Colburn,  T.;  Grunwald,  G.  Correlation  Between 
Strurture  and  Norriial  Boiling  Points  of  Haloalkanes  C1-C4  using  Neural  Networks.  J. 
Chem.  Inf.  Comput.  Sci.  1994,  34, 1118-1121. 


9 


[10]  Basak,  S.  C.;  Grunwald,  G.  D.  In  Proceeding  of  the  XVI  International  Cancer 
Congress,  R.  S.  Rao,  M.  G.  Deo,  L.  D.  Sanghui,  Eds.;  Monduzzi:  Bologna,  Italy,  1995, 
pp.  413-416. 

[11]  Basak,  S.  C.;  Grunwald,  G.  D.;  Niemi,  G.  J.  Use  of  Graph-Theoretic  and 
Geometrical  Molecular  Descriptors  in  Structure-Activity  Relationships.  In  From  Chemical 
Topology  to  Three  Dimensional  Molecular  Geometry,  Balaban,  A.  T.,  Ed.;  Plenum 
Press:  New  York,  1997;  pp  73-116. 

[12]  Hansch,  C.  and  Leo,  A.  (1995).  Exploring  QSAR:  Fundamentals  and  Applications 
in  Chemistry  and  Biology.  American  Chemical  Society,  Washington,  D.C.,  p.  557. 

[13]  Auer,  C.M.,  Nabholz,  J.V.,  and  Baetcke,  K.P.  (1990).  Mode  of  action  and  the 
assessment  of  chemical  hazards  in  the  presence  of  limited  data:  Use  of  structure- 
activity  relationships  (SAR)  under  TSCA,  Section  5.  Environ.  Health  Perspect.  87,  183- 
197. 

[14]  Basak,  S.C..  Gute.  B.D.,  and  Grunwald,  G.D.  (1996).  A  comparative  study  of 
topological  and  geometrical  parameters  in  estimating  normal  boiling  point  and 
octanol/water  partition  coefficient,  J.  Chem.  Inf.  Comput.  Sci.  36,  1054-1060. 

[15]  Basak,  S.C.  and  Grunwald,  G.D.  (1995).  Predicting  genotoxicity  of  chemicals 
using  nonempirical  parameters.  In,  Proceedings  of  the  XVI  International  Cancer 
Congress  (R.S.  Rao,  M.G.  Deo,  and  L.D.  Sanghui,  Eds.).  Monduzzi,  Bologna,  Italy,  Vol. 
7,  pp  413-416. 

[16]  Basak,  S.C.,  Gute,  B.D.,  and  Grunwald,  G.D.  (1997).  Use  of  topostructural, 
topochemical  and  geometric  parameters  in  the  prediction  of  vapor  pressure:  A 
hierarchical  QSAR  approach.  J.  Chem.  Inf  Comput.  Sci.  37,  651-655. 

[17]  Gute,  B.D.  and  Basak,  S.C.  (1997).  Predicting  acute  toxicity  (LC50)  of  benzene 
derivatives  using  theoretical  molecular  descriptors:  A  hierarchical  QSAR  approach.  SAR 
QSAR  Environ.  Res.  7, 117-131. 

[18]  Basak,  S.C.,  Gute,  B.D.,  and  Grunwald,  G.D.  (1998).  Prediction  of  complement- 
inhibitory  activity  of  benzamidines  using  topological  and  geometric  parameters.  J. 
Chem.  Inf  Comput.  Sci.  In  press. 

[19]  Gute,  B.D.,  Grunwald,  G.D.,  and  Basak,  S.C.  (1998).  Predichon  of  the  dermal 
penetration  of  polycyclic  aromatic  hydrocarbons  (PAHs):  A  hierarchical  QSAR 
approach.  SAR  QSAR  Environ.  Res.  In  press. 

[20]  Basak,  S.C.,  Harriss,  D.K.,  and  Magnuson,  V.R.  (1988).  POLLY  2.3:  Copyright  of 
the  University  of  Minnesota. 


10 


[21]  Wiener.  H.  (1947).  Structural  determination  of  paraffin  boiling  points.  J.  Am. 
Chem.  Soc.  69,  17-20. 

[22]  Randic,  M.  (1975).  On  characterization  of  molecular  branching.  J.  Am.  Chem. 
Soc.  97,  6609-6615. 

Ravchaudhurv  C ,  Ray.  S.K..Ghosh,  J.J..  Roy.  A.B..  and  Basak,  S.C.  (1984) 
DiscrlmSorof  isSineric  structures  using  information  theoretic  topological  indices.  J. 

Comput.  Chem.  5.  581-588. 

1241  Bonchev,  D.,  and  Trinajstlc,  N.  (1977).  Information  theory,  distance  matrix  and 
molecular  branching.  J.  Chem.  Phys.  67.  4517-4533. 

rom  Rasak  SC  Roy  A  B.,  and  Ghosh,  JJ.  (1980).  Study  of  the  structure-function 
mfitionshl  of  pSarmawlogical  and  toxicological  agents  using  information  theory  in 
of  ffie  Second  International  Conference  on  Maihematica  Modelling  (XXR. 
iSn,  Y.L.  Luke  and  A.K.  Rigler,  Eds  ).  University  of  Missouri  -  Rolla, 

pp. 851-856. 

roRi  Ra^ak  SC  and  Magnuson.  V.R.  (1983).  Molecular  topology  and  narcosis:  A 
Luantitatle  slr^ctur^a^vky  relationship  (QSAR)  study  of  alcohols  using 
Somplernentary  information  content  (CIC).  Arzneim.  Forach.  33,  501-503. 

ro7i  Rnv  A  B  Basak  S  C  Harriss,  D.K.,  and  Magnuson,  V.R.  (1984).  Neighborhood 
Lmplexi/es  and  symmetry  of  '’JSJSlf  R  l''San,''A'L 

[281  Balaban,  A.T.  (1982).  Highly  discriminating  distance-based  topological  index. 
Chem.  Phys.  Lett.  89,  399-404. 

(291  Balaban,  A.T.  (1983).  Topological  indtes  based  on  topological  distances  in 
moiecular  graphs.  Pure  and  Appl.  Chem.  55.  199-206. 

Balaban  AT  (1986).  Chemical  graphs.  Part  48.  Topological  index  J  for 
IfetLoato^^^  molecules  taking  into  account  periodicities  of  element  properties. 

Math.  Chem.  (MATCH).  2^,  ^^5-^22. 

[31]  Bondi  A.  (1964).  Van  der  Waals  volumes  and  radii.  J.  Phys.  Chem.  68,  441-451. 
Bull.  24,  1799-1806. 

[331  Moriguchi,  I.,  and  Kanada,  Y.  (1977).  Use  of  van  der  Waals  volume  in  structure- 
Utivity  studies.  Chem.  Pharm.  Bull.  25.  926-935. 


11 


[34]  SYBYL  Version  6.1.  (1994).  Tripos  Associates,  Inc.:  St.  Louis,  MO. 

[35]  Mekenyan,  O.,  Peitchev,  D.,  Bonchev,  D.,  Trinajstic,  N.,  and  Bangov,  I.  (1986). 
Modelling  the  interaction  of  small  organic  molecules  with  biomacromolecules.  I. 
Interaction  of  substituted  pyridines  with  anti-3-azopyridine  antibody.  Arzneim.- 
Forsch. /Drug  Research  36,176-183. 

[36]  CONCORD  Version  3.0.1.  (1993).  Tripos  Associates,  Inc.:  St.  Louis,  MO. 

[37]  Stewart,  J.J.P.  (1990).  MOPAC  Version  6.00.  QCPE  #455.  Frank  J  Seiler  Research 
Laboratory:  US  Air  Force  Academy,  CO. 

[38]  SAS  Institute  Inc.  (1988).  In  SAS/STAT  User's  Guide,  Release  6.03  Edition.  SAS 
Institute  Inc.:  Cary,  NC. 

[39]  Leo,  A.  and  Weininger,  D.  (1984).  CLOGP  Version  3.2  User  Reference  Manual. 
Medicinal  Chemistry  Project,  Pomona  College,  Claremont,  CA. 

[40]  Russom,  C.L.;  Anderson,  E.B.;  Greenwood,  B.E.;  Pilli,  A.  (1991).  ASTER:  An 
integration  of  the  AQUIRE  data  base  and  the  OSAR  system  for  use  in  ecological  risk 
assessments.  Sci.  Total  Environ.  109/110,  667-670. 

[41]  Roy,  T.A.,  Neil,  W.,  Yang,  J.J.,  Krueger,  A.J.,  Arroyo,  A.M.,  and  Mackerer,  C.R. 
(1998).  SAR  models  for  estimating  the  percutaneous  absorption  of  polynuclear  aromatic 
hydrocarbons.  SAR  QSAR  Environ.  Res.,  in  press. 

[42]  Flansch,  C.;  Yoshimoto,  M.  (1974).  Structure-activity  relationships  in 
immunochemistry.  2.  Inhibition  of  complement  by  benzamidines.  J.  Med.  Chem.  17, 
1160-1167. 

[43]  Hall,  L.H.,  Kier,  L.B.,  and  Phipps,  G.  (1984).  Structure-activity  relationship  studies 
on  the  toxicities  of  benzene  derivatives:  I.  An  additivity  model.  Environ.  Toxicol.  Chem. 
3,  355-365. 

[44]  Soderman,  J.V.  (Ed.).  (1982).  CRC  Handbook  of  Identified  Carcinogens  and 
Noncarcinogens:  Carcinogenicity-Mutagenicity  Database,  CRC  Press,  Inc.,  Boca  Raton, 
FL,  Volume  I,  p  655. 


12 


Table  I.  Symbols,  definitions  and  classifications  of  topostructural,  topochemical, 

geometrical  and  quantum  chemical  descriptors. _ _ _ 

Topostructural  _ 

l'^  Information  index  for  the  magnitudes  of  distances  between  all  possible 

pairs  of  vertices  of  a  graph 

I'J  Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix 
of  a  graph 

p  Degree  complexity 

Graph  vertex  complexity 
Graph  distance  complexity 

]C  Information  content  of  the  distance  matrix  partitioned  by  frequency  of 

occurrences  of  distance  h 

O  Order  of  neighborhood  when  ICr  reaches  its  maximum  value  for  the 

hydrogen-filled  graph 

Mi  A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

M2  A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 

neighboring  (connected)  vertices 

''y  Path  connectivity  index  of  order  h  =  0-6 

\  Cluster  connectivity  index  of  order  h  =  3-6 

•’ych  Chain  connectivity  index  of  order  h  =  3-6 

*’xpc  Path-cluster  connectivity  index  of  order  h  =  4-6 

Ph  Number  of  paths  of  length  h  =  0-10 

J  Balaban's  J  index  based  on  distance  _ _ 

T  opochemical _ 

loRB  Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 
maximum  neighborhood  of  vertices 

ICr  Mean  information  content  or  complexity  of  a  graph  based  on  the  r‘^  (r  =  0-6) 
order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

SICr  Structural  information  content  for  r‘^  (r  =  0-6)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

CICr  Complementary  information  content  for  r*^  (r  =  0-6)  order  neighborhood  of 
vertices  in  a  hydrogen-filled  graph 

Bond  path  connectivity  index  of  order  h  =  0-6 

^y**  Bond  cluster  connectivity  index  of  order  h  =  3-6 
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Vch 

Vpc 

V'' 

Vc 

'’Xch 

Vpc 

J" 


Vw 

3Dw 


Ehomo 

Ehomoi 

Elumo 

Elumoi 

AHr 


Bond  chain  connectivity  index  of  order  h  =  3-6 
Bond  path-cluster  connectivity  index  of  order  h  =  4-6 
Valence  path  connectivity  index  of  order  h  =  0-6 
Valence  cluster  connectivity  index  of  order  h  =  3-6 
Valence  chain  connectivity  index  of  order  h  =  3-6 
Valence  path-cluster  connectivity  index  of  order  h  =  4-6 

Balaban's  J  index  based  on  bond  types 

Balaban's  J  index  based  on  relative  electronegativities 

Balaban's  J  index  based  on  relative  covalent  radii 

Geometrical 

Van  der  Waal’s  volume 

3-D  Wiener  number  for  the  hydrogen-suppressed  geometric  distance  matrix 
3-D  Wiener  number  for  the  hydrogen-filled  geometric  distance  matrix 

Quantum  Chemical 

Energy  of  the  highest  occupied  molecular  orbital 
Energy  of  the  second  highest  occupied  molecular  orbital 
Energy  of  the  lowest  unoccupied  molecular  orbital 
Energy  of  the  second  lowest  unoccupied  molecular  orbital 
Heat  of  formation 
Dipole  moment 
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Table  II.  Classification  results  for  520  mutagens/non-mutagens  from  DFA. _ ^ _ 

- -  ■  %  Mutagens  %  Non-mutagens 

Model  type _  Indices  included _ correct _ correct _ 


topostructural 

W,  H'',  H°,  IC,  Mn^x.  ®Xc. 

®Xpc.  Pio 

76.2 

57.3 

topostructural 
+  topochemical 

H°,  Mi,2x.  Pio.  IC5.  Ych.  V.  V. 
Ych,  Ych.  YpC. 

74.6 

63.1 

topostructural 
+  topochemical 
+  fragments 

H°,  Ml,  ^x.  Pio.  IC5,  V.  Vch. 
Vpc.  J®.  nitroso\  mustard^,  sulf^, 
benz'* 

69.2 

71.9 

topostructural 
+  topochemical 
+  fragments 
+  geometrical 

H°,  Ml,  Pio>  IC5,  V.  Vch. 
Vpc.  J®.  nitroso\  mustard^,  sulf^, 
benz'*,  Vw 

71.5 

71.9 

’Nitroso-  compounds. 

^Halogenated  substituted  mustard,  sulfur  mustard  or  oxygen  mustard. 
^Organic  sulfates  or  sulfonates. 

'‘Biphenyl  amine,  benzidine  or  4,4’-methylenedianiline  derivatives. 
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Figure  Legend: 

Figure  1  Diagramatic  representation  of  the  first  two  stages  in  hierarchical  QSAR 
model  development  from  topological  indices. 


Figure  2 

Figure  3 

Figure  4 

Figure  5 

Figure  6 

Figure  7 
Figure  8 

Figure  9 


Scatterplot  of  experimental  normal  boiling  point  vs  estimated  normal 
boiling  point  using  equation  3  for  1023  diverse  chemicals. 

Scatterplot  of  experimental  logP  vs  estimated  logP  using  equation  6  for 
219  diverse  chemicals. 

Scatterplot  of  experimental  normal  vapor  pressure  vs  estimated  normal 
vapor  pressure  using  equation  8  for  476  diverse  chemicals. 

Scatterplot  of  the  residual  vs  experimental  normal  vapor  pressure  from 
equation  8  for  476  diverse  chemicals. 

Scatterplot  of  experimental  percent  dermal  penetration  vs  estimated 
percent  dermal  penetration  using  equation  10  for  60  polycyclic  aromatic 
hydrocarbons. 

Neutral  base  structure  for  the  107  benzamidines. 

Scatterplot  of  experimental  complement  inhibition  vs  estimated 
complement  inhibition  using  equation  12  for  105  benzamidines. 
Scatterplot  of  experimental  acute  aquatic  toxicity  (LC50)  vs  estimated 
acute  aquatic  toxicity  using  equation  16  for  69  benzene  derivatives. 
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Appendix  1.5  Quantitative  comparison  of  five  molecular 

structure  spaces  in  selecting  analogs... 
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ABSTRACT 

Five  methods  for  characteriring  intermolecular  similarity  have  been  used  in  the  selection  of  analogs  for  a 
diverse  set  of  seventy-six  compounds.  These  methods  include  an  atom  pair  (AP)  based  similarity 
measure,  three  principal  component  spaces  derived  Rom  topostructural  indices,  topochemical  indices,  the 
combined  set  of  all  (topostructural  and  topochemical)  indices,  as  well  as  one  structure  space  consisting 
of  principal  components  calculated  from  physicochemical  properties.  Each  method  has  been  used  in  the 
selection  of  sets  analogs,  ran^g  from  five  to  forty  in  number  in  increments  of  five,  for  each  of  the 
seventy-six  compounds.  The  degree  of  overlap  of  the  sets  of  analogs  selected  by  Ae  five  separate 
methods  was  analyzed. 


KEYWORDS 

molecular  graph,  atom  pturs,  principal  conqionents,  analog  selection,  molecular  similarity 


INTRODUCTION 

Molecular  similarity  is  an  intuitive  concept  which  is  subjectively  understood  by  the  chemist  In  the  realm 
of  mathematical  and  con^iutadonal  chenustiy,  intermolecular  similarity  can  be  objectively  quantified  in 
terms  of  descriptors  derived  from  the  molecular  structure  (Basak  et  al,  1988b;  Basak  et  al,  1997;  Caibd  et 
al,  1980;  Fisanick  et  al,  1992;  Fisanicket  al,  1994;  Johnson  et  al,  1988;  Maggiora  and  Johnson,  1990; 
Randic,  1992;  Willet  and  Wmterman,  1986).  Chenucal  structures  can  be  represented  by  various  types  of 
models,  e.g.,  simple  molecular  graphs,  multigraphs,  pseudographs,  3-D  models,  and  quantum  chemical 
hamiltonian  functions.  Similarity,  being  context  specific,  is  quantified  in  terms  of  a  user-defined  set  of  • 
parameters  or  properties  of  molecules.  Consequently,  there  are  a  potentially  endless  number  of  methods 
that  one  can  define  to  quantify  intermolecular  sitmlarity. 

In  recent  years  molecular  sinularity  methods  based  on  topological  and  substructural  descriptors  have 
become  popular.  Such  methods  are  based  on  (fifferent  types  of  graph  invariants  such  is  toi^iogical 
indices,  atom  pairs,  and  fragments  (Basak  and  Grunwald,  1994, 1995c;  Basak  and  Gute,  1997;  Basak  et 
al,  1988b;  Catb6  et  al,  1980;  Carhait  et  al,  1985;  Fisanick  et  al,  1992;  Johnson  et  al,  1988;  Randid,  1992; 
Willet  and  Winterman,  1986).  Sirailarity/dissimilarity  methods  have  been  used  in  the  clustering  of  large 
sets  of  chemicals  (Lajiness,  1990),  the  selection  of  analogs  for  toxicological  risk  assessment  (Basak  and 
Grunwald,  1994;  Basak  et  al,  1995),  and  the  estimation  of  the  physicochemical  and  biomedicinal 
properties  of  chemicals  (Basak  and  Grunwald,  1995a,  1995c;  Basak  et  al,  1996a;  Basak  and  Gute,  1997). 
Usually  some  number,  n,  of  descriptors  iS  used  to  define  the  structure  space  of  chemicals  and  cither 
Euclidean  distance  in  the  n-dimensional  space  or  some  association  coefficient  is  used  to  quantify 


intermolccular  similarity.  The  basic  paradigm  underlying  molecular  similarity  analysis  is  similar 
structures  have  similar  properties."  However,  it  has  been  shown  that  different  molecular  similarity 
methods  select  quite  different  sets  of  analogs  from  a  specific  database  for  the  same  set  of  query 
chemicals  (Basak  and  Grunwald.  1995c).  In  the  case  of  the  automated  selection  of  analogs  for  testing 
chemicals  in  drug  design  protocols  or  toxicological  hazard  assessment  one  would  like  to  select  analogs 
by  reasonably  non-redundant  molecular  similarity  methods.  Therefore,  it  is  of  interest  to  investigate  the 
degree  to  which  various  similarity  methods  differ  from  each  other.  In  a  previous  study  we  analyzed  tlie 
analog  selection  profiles  for  topologically-based  vis-a-vis  empirical  property-based  molecular  similarity 
techniques  in  the  selection  of  nearest  neighbors  of  molecules  (Basak  and  Grunwald.  1995c).  In  this  paper 
we  have  compared  the  analog  selection  profile  of  five  different  molecular  similarity  methods,  four  of 
which  are  based  on  graph  invariants  and  one  is  derived  from  physicochemical  property  data. 


DATABASE  AND  PARAMETERS 

Development  of  the  database  ^ 

The  data  used  in  this  study  is  a  subset  of  the  U.S.  EPA  ASTER  system  (Russom,  1992)  which  met  the 
following  criteria.  These  compounds  have  experimental  values  for; 

1 .  Log  Koa.  Logarithm  of  the  octanol/water  partition  coefficient  (hydrophobicity). 

2.  BP  Boiling  point  at  760  Torr. 

3.  MP  Melting  point. 


within  the  ASTER  database.  Kamlet  (1987)  provided  the  remaining  physicochemical  properties  used  in 
this  study.  These  four  solvatochromic  parameters  are:  ,, 


1. 

WlOO 

The  molar  volume  of  a  molecule  calculate  as  its  molecular  weight  divided  by 
the  liquid  density  at  20°  C. 

2. 

a 

A  measure  of  the  hydrogen  bond  donor  acidity  of  a  compound  in  forming  a 
hydrogen  bond. 

3. 

P 

A  scale  of  the  hydrogen  bond  acceptor  basicity  of  a  compound  in  fomung  a 
hydrogen  bond. 

4. 

# 

7t 

A  measure  of  solute  or  solvent  dipolarity  or  polarizability  that  quantifies  the 
ability  of  a  compound  to  stabilize  a  neighboring  charge  or  dipole  by  virtue  of  its 
dielectric  effect. 

Kamlet  et  al  (1988)  describe  in  detail  the  methods  used  in  the  determination  of  these  solvatochromic 
parameters. 

Calculation  of  Atom  Pairs 

Atom  pairs  (APs)  were  calculated  using  the  method  of  Carhart  et  al  (1985).  An  atom  pair  is  defined  as  a 
substructure  which  consists  of  two  non-hydrogen  atoms  i  and/  and  their  interatomic  separation: 

<descriptorp>-<separation>-<descriptor|> 

where  <descriptor>  contains  information  about  the  element  type,  number  of  non-hydrogen -neighbors, 
and  the  number  of  it  electrons  for  each  atom.  The  interatomic  separation  of  two  atoms  is  the  number  of 
atoms  traversed  in  the  shortest  bond-by-bond  path  containing  both  atoms.  These  calculations  were 
conducted  using  the  APProbe  software  developed  by  Basak  and  Grunwald  (1993). 

Calculation  of  Topological  Indices 

The  topological  indices  used  in  this  study  have  been  calculated  using  the  program  POLLY  2.3  (Basak  et 
al,  1988a)  and  software  developed  by  the  authors  to  calculate  Balaban  s  J  indices.  A  complete  listing  of 


these  indices,  along  with  examples  of  their  calculation  have  been  given  in  detail  previously  (Basak  and 
Gute,  1997;  Basak  etal,  1997). 

The  topological  indices  were  further  divided  into  two  subsets,  topostructural  and  topochemical  indices. 
Topostructural  indices  are  topological  indices  which  only  encode  information  about  the  adjacency  and 
distances  of  the  vertices  (atoms)  within  a  graph  (molecular  structure),  irrespective  of  the  chemical  nature 
of  the  atoms  involved.  The  topochemical  indices  are  parameters  which  quantify  infonnation  regarding 
the  topology  of  the  graph  (molecule),  as  well  as  specific  chemical  properties  of  the  atoms  and  bonds 
comprising  the  molecule.  These  indices  are  derived  from  weighted  graphs  where  each  vertex  (atomy  or 
edge  (bond)  is  properly  weighted  with  selected  chemical  information.  The  division  of  the  topological 
indices  into  these  distinct  sets  has  been  discussed  in  previous  studies  (Basak  et  al,  1996b,  1997). 

Similarity  Measures 

Two  measures  of  intermolecular  similarity  were  used  in  this  study.  The  methods  have  been  described  in 
detail  previously  (Basak  and  Grunwald,  1995b)  and  include  an  associative  measure^  using  atom  pairs 
(AP)  and  Euclidean  distance  (ED)  within  an  n-dimensional  principal  component  (PC)  space.  The 
Euclidean  distance  method  was  used  in  conjunction  with  the  topological  indices  and  the  physicochemical 
property  data. 


ANALOG  SELECTION 

Following  the  quantification  of  intermolecular  similarity  for  the  five  sinularity  spaces,  the  /iT-nearest 
neighbors  or  analogs  {K  -  5, 10, 15, 20, 25, 30, 35, 40)  were  determined  on  the  basis  of  the  associative 
measure  used  in  conjunction  with  the  AP  method  or  based  on  ED  within  a  principal  component  space. 


RESULTS  AND  DISCUSSION 

In  generating  the  principal  components  for  the  sets  of  topological  indices,  only  the  principal  components 
with  eigenvalues  greater  than  1.0  were  retained.  This  left  six  PCs  for  the  set  of  topostructural  indices 
which  cumulatively  explained  94.1%  of  the  variance  in  the  indices,  eight  PCs  for  the  set  of  topochemical 
indices  which  explained  93.5%  of  the  variance  in  these  indices,  and  ten  PCs  for  the  set  of  all  topological 
indices  which  cumulatively  explained  95.2%  of  the  variance  in  the  topological  indices.  These  formed  the 
final  sets  of  PCs  which  were  used  in  creation  of  the  similarity  spaces  and  selection  of  analogs  for  these 
three  methods. 

Each  similarity  method  was  used  to  select  sets  of  analogs  for  each  of  the  seventy-six  compounds  in  the 
dataset  The  analogs  selected  by  each  set  were  compared  with  the  analogs  selected  by  every  other  method 
to  examine  the  overlap  between  the  sets  of  analogs.  The  results  of  this  comparison  are  presented  in  Table 
1  below  as  the  arithmetic  mean  of  the  cardinalities  of  the  intersection  of  subsets  of  analogs  chosen  by  a 
particular  pair  of  similarity  methods  for  a  specific  value  of  K.  For  example,  the  topostructural  and 
topochemical  similarity  methods  selected  an  average  of  2.2  identical  analogs  out  of  five  for  the  entire  set 
of  seventy-six  chemicals.  Thus,  sligfitly  under  half  of  the  analogs  selected  by  the  two  methods  were 
identical. 

It  is  clear  fo>m  the  data  in  Table  1  that  the  five  molecular  similarity  methods  studied  in  this  paper  are  not 
radically  different  from  one  another  because  they  have  a  substantial  degree  of  overlap  in  the  profile  of 
selected  neighbors.  This  is  an  interesting  observation  in  view  of  the  fact  that  the  structure  spaces  are 
constructed  from  such  diverse,  independent  variables  as  experimentally  determined  physicochemical 
properties  and  calculated  graph  invariants. 

A  perusal  of  the  data  also  shows  that  the  property-based  similarity  method  is  distinct  from  the  group  of 
methods  based  on  topological  indices  and  atom  pairs.  For  K  -  20,  for  example,  the  average  number  of 


common  neighbors  for  the  property-based  methods  vis-a-vis  the  topostructural.  topochemical.  all  index 
and  atom  pair-based  methods  are  8.7.  8.9,  8.6  and  8.9,  respectively.  For  the  same  value  of  K,  the  number 
of  common  analogs  for  the  topostructural  method  with  atom  pair,  topochemical  and  all  index  methods 
are  12.3,  12.2  and  13.1,  respectively. 

Table  1.  Comparisons  of  the  overlap  in  analog  selection  for  five  distinct  similarity  methods. - 


K  SvsC  SvsT  CvsT  S  vs  P  CvsP  TvsP  SvsA  C  vs  A  Tvs  A  PvsA 


5 

vJ  VO  V.-. 

2.2 

2.5 

3.5 

1.2 

1.6 

1.6 

2.2 

2.1 

2.3 

1.9 

10 

5.0 

5.4 

7.1 

3.1 

3.4 

3.5 

4.8 

4.7 

5.0 

4.1 

15 

8.6 

9.2 

11.3 

5.6 

5.7 

5.7 

8.2 

7.8 

8.1 

6.3 

20 

12.2 

13.1 

15.1 

8.7 

8.9 

8.6 

12.3 

10.7 

11.0 

8.9 

25 

15.7 

16.7 

•19.5 

12.1 

12.3 

11.9 

16.3 

14.3  1 

14.3 

12.1 

30 

20.0 

20.9 

23.8 

16.0 

16.6 

15.8 

19.5 

17.4 

17.4 

15.7 

35 

24.7 

25.6 

28.9 

20.5 

21.1 

20.0 

22.9  . 

21.4 

21.1 

20.4 

40 

30.4 

30.9 

33.9 

25.1 

25.9 

25.0 

26.6 

.25.9 

25.5 

24.6 

S  =  topostructural  indices  P  =  physicochemical  parameters 

C  =  topochemical  indices  A  =  atom  pairs 

T  =  all  topological  indices _ 


For  the  three  similarity  methods  calculated  from  the  topological  indices,  the  topochemical  indices  seem 
to  have  more  influence  on  the  selection  of  neighbors  when  they  are  used  along  with  topostmctural 
parameters  as  independent  variables.  This  is  clear  from  the  fact  that  for  almost  all  values  of  K  the 
topochemical  and  all  index  methods  have  a  uniformly  higher  degree  of  overlap  as  compared  to  that 
between  the  topostructural  and  all  index  methods. 

In  conclusion,  if  one  is  interested  in  selecting  only  two  candidates  from  the  set  of  five  methods  studied 
here  for  analog  selection,  the  property-based  method  and  any  one  of  the  theoretically-based  rnethods 
would  be  the  choice.  There  is  no  criteria  to  decide  which  of  the  four  topologically-based  methods  should 
be  selected  for  a  particular  occasion.  Further  studies  of  the  analog  sel^tion  and  property  pi^iction 
profile  of  these  methods  are  necessary  to  guide  the  selection  of  a  specific  method  for  a  particular 

practical  situation. 
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We  calculated  202  molecular  descriptors  (topological  indices,  TIs)  for  two  chemical 
databases  (a  set  of  139  hydrocarbons  and  another  set  of  1037  diverse  chemicals).  Variable  cluster 
analysis  of  these  TIs  grouped  these  structures  into  14  clusters  for  the  first  set  and  into  18  clusters 
for  the  second  set.  Correspondences  between  the  same  TIs  in  the  two  sets  reveal  how  and  why  the 
various  classes  of  TIs  are  mutually  related  and  provide  insight  into  what  aspects  of  chemical 
structure  they  are  expressing. 


1.  INTRODUCTION 
A  major  part  of  the  current  research  in 
mathematical  chemistry,  chemical  graph 
theory,  and  quantitative  structure- 
activity/property  relationship  studies 
involves  topological  indices.  Topological 
indices  (TIs)  are  numerical  graph  invariants 
that  quantitatively  characterize  molecular 
structure.  A  graph  G  =  (V,  E)  is  an  ordered 
pair  of  two  sets  V  and  E,  the  former 
representing  a  nonempty  set  and  the  latter 
representing  unordered  pairs  of  elements  of 
the  set  V.  When  V  represents  the  atoms  of  a 
molecule  and  elements  of  E  symbolize 
covalent  bonds  between  pairs  of  atoms,  then 
G  becomes  a  molecular  graph  (or 
constitutional  graph,  because  there  is  no 
stereochemical  information).  Such  a  graph 
depicts  the  topology  of  the  chemical  species. 
A  graph  is  characterized  using  graph 
invariants.  An  invariant  may  be  a 
polynomial,  a  sequence  of  numbers,  or  a 
single  number.  A  numerical  graph  invariant 
(i.  e.  a  single  number)  which  characterizes 
the  molecular  structure  is  called  a 
topological  index. 


2.  OVERVIEW  OF  TOPOLOGICAL 
INDICES  USED  IN  THE  PRESENT 
STUDY 

A  large  number  of  topological  indices 
have  been  defined  and  used.*  "  The  majority 
of  TIs  are  derived  from  the  various  matrices 
corresponding  to  molecular  graphs.  The 
adjacency  matrix  A(G)  and  the  distance 
matrix  D(G)  of  the  molecular  graph  G  have 
been  most  widely  used  in  the  formulation  of 
TIs.  Integer-number  local  vertex  invariants 
(LOVIs)  are  the  vertex  degrees  (vi)  and  the 
distance  sums  (distasums,  di)  resulting  from 
summation  over  rows  or  columns  of  entries 
in  the  adjacency  and  distance  matrices, 
respectively.  By  mathematical  operations 
performed  on  such  LOVIs,  one  can  obtain  a 
molecular  descriptor,  i.  e.,  a  topological 
index.  Wiener’s  index  W  (eq.  1),  the 
Zagreb  group  index  Mi  (eq.  2),**  Randies 
connectivity  index,  x  higher 

order  connectivity  indices,  "x,  for  paths  of 
length  n  defined  by  Kier  and  Hall,  and  the 
J  index  (eq.  4),^  fall  in  this  category. 

W  =  (Eidi)/2  (1) 

M,  =21  Vi^  (2) 

X  =  2y(ViVj)-"'  (3) 
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J  =  [q/(^+l)]5:ij(d.dj)-'''  (4) 

The  summations  in  formulas  (3)  and  (4) 
are  over  all  edges  i-j  in  the  hydrogen- 
depleted  graph.  The  numbers  q  of  graph 
edges,  and  p  of  cycles  in  the  graph  are 
introduced  into  formula  (4)  in  order  to  avoid 
the  automatic  increase  of  J  with  graph  size 
and  cyclicity.  Indeed,  for  an  infinite  linear 
carbon  chain  it  was  demonstrated  that  J  =  n 
=  3.14159.  The  nature  of  atoms  can  be  taken 
into  account  by  means  of  parameters  based 
on  their  relative  atomic  numbers, 
electronegativities,  or  covalent  radii,  with 
respect  to  those  of  carbon  atoms, 
multiplying  the  corresponding  distasum  in 
the  formula  (4)  for  J. 

The  mean  square  root  distance  D  derived 
from  all  topological  distances  (denoted  by  i 
in  the  next  formula)  is  defined  as: 

D  =  [(Zii^)/(Iii)]‘'^  (5) 

For  taking  into  account  the  chemical 
nature  of  atoms  symbolized  by  vertices,  Kier 
and  Hall  advocated  the  use  of  “valence 
connectivity  indices".^^^  These  are 
calculated  with  formulas  similar  to  Randic’s 
(eq.  3)  but  products  of  edge  endpoint  (or 
path  vertex)  invariants  are  no  longer  of 
vertex  degrees  but  of  weights  (valence  delta 
values  5i)  given  by  formula  (5): 

6j  =  (Zi''-Hi)/(Zi-Zi''-l)  (6) 

where  Z\  stands  for  the  number  of  valence 
electrons  in  atom  i,  Zj  is  its  atomic  number, 
and  Hj  is  the  number  of  hydrogen  atoms 
attached  to  atom  i. 

The  most  recent  additions  to  the  Kier- 
Hall  armamentary  of  TIs  are 
electrotopological  state  indices. 

Another  class  of  molecular  descriptors, 
the  information-theoretic  indices,  are 
derived  from  an  entirely  different  reasoning. 
In  this  case,  the  complexity  or  mode  of 
partitioning  of  structural  features  is 
decomposed  into  disjoint  subsets  using  an 
equivalence  relation;  a  molecular 
complexity  index  is  then  computed  using 
Shannon’s  idea  of  information  content  or 


complexity.*^  Real-number  local  vertex 
invariants  (LOVIs),  on  the  other  hand,  may 
also  be  defined  starting  from  different 
matrices  other  than  A(G)  or  D(G),  or  by 
applying  information  theory  at  the  vertex 
level.  Thus,  topological  indices  U,  V,  X,  and 
Y  were  defined.  Bonchev  and  Trinajstic 
described  several  information-theoretic  TIs 
reviewed  thoroughly  in  Bonchev’ s  book.  ’ 
The  information-theoretic  indices  developed 
by  Basak  and  coworkers  take  into  account 
all  atoms  in  the  constitutional  formula 
(hydrogens  also  being  included),  and  one 
considers  the  information  content  provided 
by  various  classes  of  atoms  based  on  their 
topological  neighborhood..  There  are  three 
main  types  of  informational  indices 
developed  by  Basak  et  al:  IC  (mean 
information  content  or  complexity  of  a 
hydrogen-filled  graph,  with  vertices  grouped 
into  equivalence  classes  having  r  vertices; 
the  equivalence  is  based  on  the  nature  of 
atoms  and  bonds,  in  successive 
neighborhood  groups);  CIC  (complementary 
information  content);  and  SIC  (structural 
information  content),  and  they  are  not 


intercorrelated  with  other  TIs. 

In  the 

following  formula,  the  summation  spans  the 
range  from  i  =  1  to  i  =  r; 

ICr  =  -  Si  Pi  log2  Pi 

(10) 

SICr  =  ICr /log2N 

(11) 

CICr  =log2N- ICr 

(12) 

The  probability  that  a  randomly 

selected 

vertex  occurs  in  the  i-th  equivalence  class  is 
denoted  by  pi.  The  ICr,  SICr  and  CICr 
indices  can  be  calculated  for  different  orders 

of  neighborhoods,  r  (r  =  0,  1,  2, . p  ) 

where  p  is  the  radius  of  the  molecular 
graph  G.  At  the  O'*’  order  level,  the  atom  set 
is  partitioned  based  solely  on  their  chemical 
nature;  at  the  level  of  the  first-order 
topological  neighborhood,  the  atoms  are 
partitioned  into  disjoint  subsets  based  on 
their  chemical  nature  and  their  first-order 
bonding  topology.  At  the  next  level,  the 
atom  set  is  decomposed  into  equivalence 
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classes  using  their  chemical  nature  and 
bonding  pattern  up  to  the  second-order 
bonded  neighbors.  The  process  is  continued 
imtil  consideration  of  higher-order  neighbors 
does  not  yield  further  increase  in  the  number 
or  composition  of  disjoint  subsets. 

A  large  variety  of  real-number  local 
vertex  invariants,  and  thence  a  larger  variety 
of  TIs,  were  described  based  on  converting  a 
matrix  (A  or  D  for  instance)  into  a  system  of 
linear  equations.  This  is  done  by  means  of 
two  colunm  vectors  that  can  convey 
topological,  chemical,  or  numerical 
information.  One  non-zero  vector  is  the  free 
term  of  the  system  of  equations.  The  other 
one  (which  may  be  zero,  but  this  restricts  the 
choices  on  available  supplementary 
information)  becomes  the  main  diagonal  of 
the  matrix  (if  both  vectors  would  be  zero, 
then  some  negative  LOVIs  would  result  with 
difficulties  of  interpretation).  These  vectors 
may  be  the  following  integers:  Z  (atomic 
number  of  the  atom  corresponding  to  each 
vertex),  V  (vertex  degree),  I  (identity),  N 
(number  of  non-hydrogen  atoms,  or  order  of 
the  graph),  N*'  (power  k  of  N).  Less 
frequently,  one  may  use  for  periodicity  of 
chemical  properties  real  numbers:  S 
(electronegativity)  or  R  (covalent  radius)  of 
the  atom  corresponding  to  each  vertex.  The 
resulting  matrix  with  the  vector  for  the  main 
diagonal  constitutes  the  set  of  coefficients 
for  the  N  unknowns  which  represent  the 
real-number  LOVIs  of  the  N  vertices.  The 
triplet  (matrix,  vector  for  the  main  diagonal 
and  vector  for  the  free  term)  also  serves  as 
notation  for  LOVIs  and  for  the  derived  TIs. 
After  solving  the  system  of  N  linear 
equations,  the  LOVIs  (xO  are  assembled  into 
a  “triplet  TI”  based  on  one  of  the  following 
operations: 

1 .  Summation,  ZiXj;  ^ 

2.  Summation  of  squares,  EjXi  ; 

3.  Summation  of  square  roots,  SiXj  ; 


4.  Sum  of  inverse  squeire  root  of  cross- 
product  over  edges  ij,  SijXjXj)'  ; 

5 .  Product,  NpiiXi]  . 

Numbers  1  through  5  of  the  above  operation 
after  the  triplet  complete  the  notation  of  the 
triplet  TIs.' 

To  conclude  this  brief  review  of  TIs,  one 
should  mention  recent  progress  that  includes 
other  matrices  such  as  the  reciprocal 
distance  matrix  which  yields  Harar:/ 
indices,'^  the  regressive  distance  matrices, 
the  Szeged  matrix,'’  and  the  resistance 
distance  matrix  which  affords  Kirchhoff 
indices.'*  So-called  optimal  structural 
descriptors  can  be  obtained  from  some  TIs 
by  varying  some  parameters  and  thereby 
adapting  them  to  the  data  base; 

alternatively,  in  Randic-type  formulas  (eqs. 

3,  4)  the  exponent  is  allowed  to  differ 
from  V^.  Three-dimensional  molecular 
descriptors  can  be  derived  from  geometrical 
and  topological  structural  features  of 
molecules.^' 

Each  of  the  indices  above  discussed  is  a 
"global"  parameter,  z.c.,  it  quantifies  certain 
aspects  of  the  entire  molecular  structure 
using  a  single  number. 

It  is  clear  from  the  above  discussion  that 
the  set  of  TIs  is  a  group  of  heterogeneous 
entities.  They  have  been  defined  to 
characterize  molecular  structure  based  on 
distinct  objectives  and  motivations.  In  spite 
of  their  distinctive  characteristics,  TIs  share 
certain  common  features.  A  topological 
index  maps  a  set  of  chemicals  C  into  the  set 
R  of  real  or  integer  numbers.  Therefore,  TIs 
quantify  some  general  aspects  of  molecular 
architecture  like  size,  shape,  symmetry, 
bonding  type,  cyclicity,  branching  pattern, 
etc. 

Topological  indices  have  been  used  for 
isomer  discrimination,  quantification  of  the 
structural  similarity/  dissimilarity  of 
molecules,  and  prediction  of  property/ 
activity  from  structure.’’  The  widespread 
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use  of  TIs  obviously  encourages  one  to  ask 
some  fundamental  questions  about  them: 
What  is  the  fundamental  nature  of  TIs?  To 
what  degree  are  they  intercorrelated?  How 
does  one  extract  orthogonal  information 
from  TIs? 

The  intercorrelation  of  TIs  was  studied 
earlier  with  a  limited  set  of  invariants.  Thus, 
Motoc  and  Balaban  described  graphically 
the  intercorrelations  of  the  few  TIs  known 
till  1981 .  These  aspects  were  reviewed  in  the 
early  1980s.^^  Basak  et  al.  studied  the 
mutual  relatedness  of  a  set  of  ninety  TIs 
calculated  for  a  set  of  3,692  diverse 
chemicals.^'*  A  third  study  by  Todeschini  et 
al.  will  be  discussed  in  the  last  section  of 
this  paper. 

All  such  studies  were  limited  in  the 
sense  that  they  analyzed  data  on  a  smaller 
and  less  diverse  group  of  TIs.  Therefore,  in 
this  paper,  we  have  studied  the  mutual 
relatedness  of  a  set  of  202  TIs.  We  have  also 
tried  to  extract  useful  and  orthogonal 
structural  information  from  the  calculated 
TIs.  This  study  also  reports,  for  the  first 
time,  a  comprehensive  discussion  of  Basak’ s 
information  content  indices  (ICr,  SICr, 
CICr),  the  triplet  indices  (proposed  by  one  of 
the  present  authors),  and  Balaban’s  average 
distance-based  cormectivity  index  J  as 
compared  to  the  traditional  and  more 
widely-used  indices. 

The  goal  of  this  paper  is  two-fold:  (a)  to 
study  the  degree  of  intercorrelation  among 
the  various  types  of  topological  indices,  and 
(b)  to  extract  mutually  uncorrelated 
(orthogonal)  topological  parameters  which 
can  be  used  for  QSAR/QSPR  studies, 
quantitation  of  intermolecular  similarity/ 
dissimilarity  as  well  as  characterization  of 
real  and  virtual  combinatorial  libraries.  To 
this  end,  we  studied  the  mutual  relatedness 
of  a  set  of  over  two  hundred  topological 
indices  in  this  paper. 


3.  METHODS 

3.1  Chemical  Databases.  There  were 

two  sets  of  chemicals  analyzed  in  this  study: 
a  set  of  139  hydrocarbons  to  represent  a 
moderately  homogeneous  set  of  chemicals 
and  a  set  of  1037  diverse  chemicals.  The 
hydrocarbons  consisted  of  73  C3-C9 

alkanes,  29  alkylbenzenes,  and  37 
polycyclic  aromatic  hydrocarbons.^^  The 
diverse  set  of  1037  compounds  consists  of 
those  chemicals  from  the  US  EPA  ASTER 
system  for  which  a  measured  boiling  point 
was  available  and  hydrogen  bonding 
potential  (as  measured  by  HBl  =  0).  did  not 
exist.  The  composition  of  these  data  sets  is 
indicated  in  Table  1.  Table  2  presents  the  list 
of  all  202  parameters  calculated  in  this 
study. 

Tables  1  and  2  arotmd  here 

3.2  Calculation  of  TIs.  The  TIs 

calculated  for  this  study  (some  of  which  are 
included  in  Table  1)  include  Wiener  number 
W,^  molecular  connectivity  indices  as 
calculated  by  Randic  '*  and  Kier  and  Hall,^ 
frequency  of  path  lengths  of  varying  size,^ 
information  theoretic  indices  defined  on 
distance  matrices  of  graphs  using  the 
methods  of  Bonchev  and  Trinajstic,^  Roy  et 
al.,^’  Basak  et  al.,^*'^'  as  well  as  those  of 
Raychaudhury  et  al.,^^  parameters  defined 
on  the  neighborhood  complexity  of  vertices 
in  hydrogen-filled  molecular  graphs,^*'^^  and 
Balaban's  J  indices  ^  as  well  as  triplet 
indices. The  majority  of  the  TIs  were 
calculated  using  the  program  POLLY  2.3.^^ 
The  J  indices  and  triplet  indices  were 
calculated  using  software  developed  in- 
house  by  the  authors. 

4.  STATISTICAL  ANALYSIS 

For  both  sets  of  chemicals,  the  computed 
TIs  were  transformed  by  the  natural 
logarithm  of  the  index  plus  a  constant, 
generally  one.  This  was  done  since  the  scale 
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of  some  indices  may  be  several  orders  of 
magnitude  greater  than  that  of  other  indices. 

For  each  set,  a  technique  known  as 
variable  clustering  was  performed  using 
SAS  procedure  VARCLUS.^"*  The  variable 
clustering  procedure  divides  the  set  of 
indices  into  disjoint  clusters,  such  that  each 
cluster  is  essentially  unidimensional.  This  is 
accomplished  by  a  repeated  principal 
components  analysis  of  the  sets  of  indices. 
The  initial  principal  component  analysis 
examines  all  indices  and  defines  two 
principal  components  or  eigenvectors.  If  the 
eigenvalue  for  the  second  component  is  > 
1.0,  the  indices  are  split  into  separate 
clusters  by  correlating  the  indices  with  the 
first  and  second  principal  component.  Those 
indices  most  correlated  with  the  first 
component  form  one  cluster  and  those 
indices  most  correlated  with  the  second 
component  form  another  cluster,  thus 
forming  two  disjoint  clusters.  A  principal 
component  analysis  is  then  performed  for 
each  cluster  of  indices,  with  the  cluster 
being  split  if  the  eigenvalue  for  the  second 
component  is  >  1.0.  The  procedure  is 
repeated  until  the  second  eigenvalue  is  <  1.0 
for  all  clusters. 

5.  RESULTS  AND  DISCUSSION 
The  first  database  (denoted  by  A) 
consists  of  139  hydrocarbons  (alkanes, 
alkylbenzenes  and  polycyclic  aromatics)  and 
162  Tls.  The  number  of  indices  examined 
was  reduced  from  the  original  202  by 
removing  all  but  one  of  the  degenerate  (i.e. 
correlation  of  1.0)  indices  and  those  indices 
that  were  constant  (0.0)  for  all  chemicals. 
The  second  database  (denoted  by  B)  is  a 
diverse  one  and  contains  1037  chemical 
structures  and  176  non-degenerate,  non¬ 
constant  indices. 

The  results  of  the  variable  cluster 
analysis  will  be  presented,  discussing  first 
how  the  descriptors  (variables)  for  database 
A  become  clustered,  and  then  surveying  the 


descriptor  clustering  for  database  B,  as  well 
as  the  correspondence  between  these 
clusters.  Inter-cluster  correlation  will  then  be 
described. 

The  clusters  have  been  ordered 
according  to  decreasing  numbers  of 
descriptors  in  each  cluster;  when  clusters 
contain  the  same  number  of  descriptors,  the 
numbering  of  the  corresponding  clusters  is 
arbitrary. 

In  Fig.  1,  one  can  see,  in  graphical  form, 
on  the  left-hand  side  the  points  denoting  the 
clusters  that  group  together  the  descriptors 
for  the  hydrocarbon  database  A,  and  on  the 
right-hand  side  those  corresponding  to  the 
diverse  database  B.  Each  cluster  is  denoted 
by  a  letter  (A  or  B)  and  a  number.  The  total 
number  of  variables  in  each  cluster  is 
written  under  each  point.  Full  lines  connect 
A-type  with  B-type  clusters,  having 
inscribed  on  them  the  numbers  of 
descriptors  common  to  each  pair  of  clusters; 
when  no  number  is  inscribed,  this  indicates  a 
single  common  descriptor.  Dashed  side-lines 
denote  the  descriptors  that  do  not  have 
counterparts  in  the  other  set  of  clusters,  and 
the  associated  numbers  on  these  side-lines 
indicate  the  numbers  of  such  “orphan” 
descriptors.  Because  the  two  data  sets  differ 
both  in  the  numbers  of  compounds  and  in 
their  structures,  it  is  normal  to  expect  that 
clusters  for  one  data  set  will  have 
counterparts  in  several  clusters  in  the  other 
data  set.  This  is  indeed  what  was  found  to 
happen,  as  will  be  shown  below  when  the 
diverse  data  set  will  be  analyzed. 

Fig.  1  around  here 

Only  in  a  single  case  have  we  found  a 
one-to-one  correspondence  between  clusters 
of  descriptors  corresponding  to  the  two  data 
sets  (A12  and  B14).  Nevertheless,  in  several 
instances  (A6,  All;  B4,  B9,  B15,  B16,  and 
B17),  a  cluster  for  one  data  set  (say.  A)  was 
found  to  have  all  its  descriptors  in  common 
with  only  one  cluster  of  the  other  data  set 
(say,  B);  however,  this  latter  cluster  also 
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contains  descriptors  found  in  more  than  one 
cluster  of  the  other  set. 

5.1  Clustering  of  descriptors  for 
hydrocarbons.  The  descriptors  for  database 
A  are  grouped  in  14  clusters  summarized  in 
Table  3.  Cluster  A1  has  54  from  the  total  of 
162  descriptors,  therefore  it  groups  together 
about  one  third  of  all  variables.  These 
descriptors  depend  both  on  the  shape  and  the 
size  (magnitude)  of  the  molecular  graph; 
such  descriptors  include  the  Randic 
connectivity  index,  the  Kier-Hall  simple 
path  connectivity  indices,  the  Zagreb  group 
indices,  and  many  triplet  indices  having  as 
the  main  diagonal  column  vector  the  atomic 
numbers  Z  or  the  total  number  N  of  vertices. 

Table  3  around  here 

Cluster  A2  with  about  1/8  of  the  total 
number  of  descriptors  includes  molecular 
connectivity  indices  of  order  higher  than 
five,  the  J  indices,  as  well  as  two  closely 
similar  triplet  indices.  Cluster  A3  contains 
mainly  valence/bond-corrected  molecular 
connectivity  indices.  The  next  cluster,  A4, 
consists  mainly  of  the  information-based 
indices  IC  (information  content),  SIC 
(structural  information  content)  and  CIC 
(complementary  information  content)  for  the 
hydrogen-filled  graphs  of  order  higher  than 
2  for  1C  and  higher  than  3  for  SIC  and  CIC. 
Cluster  A5  is  composed  mainly  of  triplet 
indices  having  as  main  diagonal  unit  vectors 
either  distance  sums  or  total  number  N  of 
vertices. 

Each  of  the  remaining  clusters  have  less 
than  10  descriptors.  Clusters  A6  and  A7 
contain  mostly  triplet  descriptors:  A6  with 
the  distance  sum  S,  and  A7  with  the  order  N 
of  the  hydrogen-depleted  graph,  as  the  main 
diagonal  unit  vector;  cluster  A7  also 
includes  two  simple  path-cluster  molecular 
connectivity  indices.  Cluster  A8  contains 
simple  cluster-  and  bond/valence-corrected 
cluster  connectivities  of  high  order  (4 
through  6).  Cluster  A9  again  consists 
exclusively  of  triplet  indices,  and  they  are 


based  on  summing  squares  of  LOVIs  based 
mainly  on  distance  sum  unit  vectors  on  the 
main  diagonal. 

Cluster  AlO  includes  three  information- 
theoretic  indices  IC  and  SIC  of  low  order  (0 
and  1)  as  well  as  two  triplet  indices  having 
in  common  the  two  unit  vectors  (distance 
sum  S  for  the  main  diagonal,  vertex  degree 
V  for  the  free  term)  and  the  operation  for 
assembling  LOVIs  into  an  index 
(summation  of  LOVI  square  roots). 

Interestingly,  the  four  smallest  clusters 
having  four  descriptors  each  are  pairwise 
similar  in  type;  A1 1  with  A13,  and  A12  with 
A14.  Cluster  All  consists  of  information 
TIs  (IC,  SIC,  CIC)  of  low  order  (0  through 
2)  whereas  A13  includes  the  same  TIs  of 
slightly  higher  order  (2  and  3).  Clusters  A12 
and  A 14  group  together  molecular 
connectivity  indices  based  on  simple  cluster 
and  simple  cycle,  respectively. 

A  general  remark  for  the  triplet  indices 
is  that  what  groups  them  together  is  not  the 
matrix  on  which  they  are  based  (adjacency 
matrix  or  distance  matrix)  but  the  two  unit 
vectors  that  convert  such  matrices  into 
systems  of  linear  equations. 

5.2  Clustering  of  descriptors  for  the 
diverse  set  of  compounds.  There  are  18 
variable  clusters  grouping  together  176 
variables  for  the  database  of  1037  diverse 
compounds  (Table  4).  Cluster  Bl,  with  49 
descriptors,  includes  28  %  of  all  variables; 
35  of  these  descriptors  are  common  to 
cluster  Al.  Some  of  these  indices,  e.g.  W 
(Wiener  number),  Po  (number  of  non¬ 
hydrogen  atoms),  Pi  (number  of  bonds  in  the 
hydrogen-depleted  graph),  express 
molecular  size.  It  is  interesting  that  most  of 
the  triplet  variables  (AZVi,  AZNi  and  ANNi 
with  i  =  1  through  5  as  well  as  several  other 
ones)  are  found  to  be  common  to  clusters  Al 
and  Bl.  Five  other  descriptors  (°x^, 

and  ^x!)  also  appear  in  both  clusters  Al 
and  Bl. 
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T able  4  around  here 

Cluster  B2  has  nine  variables  in  common 
with  cluster  Al;  most  of  these  (3x,  4x,  P2 
through  P4)  are  path  connectivities  of 
intermediate  order.  A  couple  of  triplet 
indices  (ANVl  and  ANV5  are  also  in 
common  with  cluster  Al;  another  pair  of 
triplet  indices  (ASN3  and  ASN4)  are  in 
common  with  cluster  A7. 

Cluster  B3  contains  triplet  indices  with 
distance  sums  as  main-diagonal  vector;  they 
occur  in  clusters  A5  and  A9.  In  addition, 
two  descriptors  (MIDW  and  H*^)  appear  also 
in  cluster  Al. 

Cluster  B4  is  uniquely  associated  with 
cluster  A2,  and  consists  in  indices  5%,  6x, 
5xb,  6xb;  5xv,  6xv,  and  P6  through  Pio- 
These  descriptors  are  based  on  long  paths, 
therefore  these  variables  appear  only  when 
large  molecules  are  involved. 

Seven  of  the  11  variables  of  cluster  B5 
form  exclusively  cluster  A6;  they  are  related 
to  molecular  shape  via  vertex  complexity 
and  graph  radius.  Five  triplet  indices  such  as 
ASNl,  ASN5,  DSNl,  DSNS  and  ANV2  also 
are  common  to  these  two  clusters. 

Very  interesting  correspondences  are 
manifested  by  cluster  B6,  which  is  mainly 
associated  with  two  clusters  involving  the 
hydrocarbon  database,  namely  A4  and  A13 
(plus  one  descriptor  in  B6  which  appears  in 
A 10).  All  veiriables  are  of  information 
theoretic  type.  These  higher-order  variables 
(SIC3  through  SIC6  and  CIC3  through  CICe) 
are  common  to  clusters  B6  and  A4  and 
represent  a  true  measure  of  molecular 
complexity.  The  lower-  and  intermedia,te- 
order  indices  such  as  ICi  or  SIC2  which 
appear  in  clusters  B6  and  AlO  or  B6  and 
A 13,  respectively,  provide  information  on 
lower-order  complexity  that  may  be  more 
degenerate  than  that  furnished  by  the  higher- 
order  information  indices.  One  should  stress 
here  that  information  content  indices  form 
clusters  that  are  separate  from  clusters  with 
other  descriptors,  mezining  that  such 


variables  convey  unique  information  relative 
to  structure  and  molecular  complexity. 

Cluster  B7  consists  only  of  path-cluster 
molecular  connectivity  descriptors  which 
were  included  in  clusters  A3,  A7  and  A8  for 
the  hydrocarbons. 

Cluster  B8  includes  triplet  indices,  all  of 
which  have  the  atomic  number  Z  for  the  free 
term  vector  in  the  system  of  linear 
equations.  Most  of  these  descriptors  appear 
in  clusters  Al,  A5,  A9. 

Cluster  B9  consists  of  high-order 
connectivity-cluster  terms  all  contained  in 
cluster  A8.  For  hydrocarbons,  descriptors 

and  \''c  are  perfectly  correlated  with 
descriptor  ^XC)  therefore,  the  former 
variables  did  not  appear  in  the  hydrocarbon 
cluster  A8.  For  the  diverse-compound 
database,  such  a  correlation  is  not  perfect 
because  of  differences  in  atom  types. 

An  interesting  observation  concerns 
cluster  BIO:  all  six  variables  are  absent  from 
the  hydrocarbon  database  because  this 
database  does  not  contain  any  3-  or  4- 
membered  rings,  unlike  the  diverse 
compound  database.  This  is  why  indices 
and  ^^"x^'ch  appear  only  in 

cluster  BIO. 

Cluster  Bll  has  all  but  one  of  its 
descriptors  contained  in  cluster  A4;  these 
information  content  indices,  IC2  through 
IC6,  measure  a  high  degree  of  non¬ 
redundancy  of  topological  neighborhoods . 

Cluster  B12  has  four  of  its  variables 
contained  in  cluster  All;  these  descriptors 
(SICo,  CICo  through  CIC2)  express  lower- 
order  redundancy  of  topological 
neighborhoods.  This  is  true  of  indices  ICo 
and  SICi  as  well,  which  are  present  in 
cluster  AlO. 

From  cluster  B13,  the  six  descriptors 
(simple,  bond  and  valence  corrected  chain 
molecular  connectivity  indices)  are 
partitioned  equally  between  clusters  A2  and 
A14,  according  to  the  6-  versus  5-membered 
ring  size,  respectively;  in  the  hydrocarbon 
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data  base  A,  six-membered  chain  (or  rings) 
predominate. 

Cluster  B14  is  exclusively  associated  in 
a  one-to-one  relationship  with  cluster  A 12. 
The  corresponding  descriptors  as 

well  as  their  bond  and  valence  corrected 
counterparts  represent  coimectivity  indices 
on  three-  and  four-vertex  structural  clusters. 
For  the  hydrocarbon  database,  we  have 
again  a  case  in  which  the  two  indices  4xbc 
and  ‘*x''c,  are  perfectly  correlated  with  '‘xc, 
do  not  appear  explicitly  in  cluster  A12. 

Half  of  the  variables  (J-type  indices)  in 
cluster  B15  are  contained  in  cluster  A2. 
These  J  indices  again  form  a  cluster  apart 
from  all  other  ones  in  the  case  of  the  diverse 
data  base,  proving  that  when  heteroatoms 
are  taken  into  account,  the  information 
provided  by  such  J-type  indices  is  unique. 

Clusters  B16,  B17  and  B18  have  each  a 
small  number  of  triplet-type  descriptors;  the 
three  descriptors  of  cluster  B17  are  all 
contained  in  cluster  A7. 

5.3  Inter-cluster  correlations.  From 
each  cluster  we  select  15-25%  of  the 
descriptors  according  to  the  maximal  value 
of  the  correlation  coefficient  with  their  own 
cluster.  In  most  cases,  the  first  selected 
descriptor  also  has  the  minimal  value  of  the 
correlation  with  the  next  closest  cluster, 
expressed  by  the  l-r^  value.  When  choosing 
more  than  one  index  from  the  same  cluster, 
after  the  first  one  was  selected  as  indicated 
above,  the  next  one  must  also  fulfill  a  third 
criterion,  namely  a  low  intercorrelation  with 
the  previously  selected  indices  of  the  same 
cluster. 

There  were  four  inter-cluster  correlations 
within  the  hydrocarbon  data  set  that  were 
greater  than  0.9  and  all  involved  cluster  Al. 
Cluster  Al  was  positively  correlated  with 
A2,  A3,  and  A7.  Cluster  Al  was  correlated 
negatively  with  A5.  Each  of  the  clusters 
characterizes  some  aspect  of  molecular  size 
and  shape. 


Cluster  B1  showed  an  inter-cluster 
correlation  of  0.92  with  cluster  B2  and  -0.90 
with  cluster  B3.  These  were  the  only  inter¬ 
cluster  correlations  greater  than  0.9.  These 
clusters  are  the  three  largest  clusters  in  set 
B.  Like  cluster  Al,  cluster  B1  groups  TIs 
expressing  molecular  size  and  shape. 
Interestingly,  in  set  A  cluster  A I  also  had  a 
negative  inter-cluster  correlation  with  cluster 
A5;  it  is  therefore  not  surprising  that  clusters 
A5  and  B3  have  the  most  abundantly 
populated  line  coimecting  them  in  Fig.  1 . 

In  summary,  for  the  hydrocarbon  data 
base  there  are  four  inter-cluster  correlations 
with  r>0.90  all  involving  on  one  hand  the 
first  cluster  Al,  and  on  the  other  hand 
clusters  A2,  A3,  A5,  and  A7.  For  the  diverse 
compound  data  base  there  are  only  two  such 
inter-cluster  correlations  with  r>0.90, 
namely  B1  with  B2  and  B3.  This  is  not 
unexpected,  as  the  combination  of  the  first 
three  clusters  in  each  case  contain  more 
descriptors  than  the  parameters  remaining  in 
all  the  remaining  ones  together. 

In  this  context,  one  should  mention  that 
Todeschini  and  coworkers  published  an 
interesting  study  on  23  TIs  for  a  set  of  667 
diverse  chemicals,  20%  of  which  were 
hydrocarbons;  the  above  authors  excluded 
10  of  these  TIs  because  they  were 
degenerate,  redundant,  or  had 
intercorrelation  factors  higher  than  0.90.  A 
graph  depicting  highly  intercorrelated 
indices  using  data  published  by  these 
authors  is  presented  in  Fig.  2,  which  is 
similar  to  a  graph  published  earlier. 

Fig.  2  around  here 

Ten  TIs  were  then  selected  by 
Todeschini  et  al.^^  namely  the  molecular 
weight  (MW),  J,  IC,  CIC,  the  bonding 
information  content  (BIC),  mean  Randic 
connectivity  (x)  the  information  content  on 
atomic  composition  (Uc),  the  mean  Wiener 
index  (W),  and  the  mean  information  indices 
on  equality  of  distance  degree  and  on  the 
magnitude  of  distance  degree  (I^d.  deg  and 
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I'^D,  deg,  respectively).  Then,  using  principal 
component  analysis  for  the  above  10  TIs, 
Todeschini  et  al.  analyzed  the  composition 
of  the  first  six  principal  components.  They 
found  that  the  first  PC  is  mainly  composed 
of  indices  that  express  the  size  of  molecules 
(MW,  W,  IC,  I^D,  deg  and  I'^d,  deg)-  This  is  in 
agreement  with  the  eeirlier  finding  of  Basak 
et  al.  for  a  diverse  set  of  3,692  diverse 
chemicals  that  the  first  PC  is  related  to 
molecular  size.^’  Further,  Todeschini  et  al. 
found  that  the  second  PC  is  dominated  by 
indices  expressing  information  on  bonds 
(IC,  CIC,  and  BIC).  This  is  also  analogous 
to  the  results  reported  by  Basak  et  al.  that 
the  second  axis  represents  molecular 
complexity  as  encoded  by  higher  order 
neighborhood  complexity  indices  (IC2,  IC3, 
SIC2,  SIC3,  CIC2,  CIC3,  etc.).  The  IC,  CIC 
and  BIC  indices  used  by  Todeschini  et  al. 
are  based  solely  on  first-order  topological 
bonding/neighborhoods,  and  slightly 
different  equivalence  relations  as  compared 
to  the  ICr,  SICr,  and  CICr  indices  defined  by 
Roy  et  al.  In  studies  by  Basak  et  al.,  the 
first-order  complexity  indices  (ICi,  SICi, 
CICi)  were  usually  most  correlated  with  the 
first  PC.  Each  of  the  next  four  PCs  in 
Todeschini  et  al.’s  study  are  dominated  by 
a  single  TI,viz.,:  ^ac,  J  (indicating 

branching),  and  I^d,  deg  (connected  with  the 
position  of  substituents  on  the  molecular 
scaffold),  respectively. 
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Table  1 .  Summary  of  Chemical  Classes  or  Features  in  Databases  Analyzed. 


Chemical  Classes  or  features 

Database  A 
(Hydrocarbons) 

Database  B 
(Diverse) 

Total  Number  of  Compounds 

139 

1037 

Hydrocarbons 

139 

565 

♦  Alkanes,  Cyclic  Alkanes 

73 

206 

♦  Aromatics 

66 

288 

-  Alkyl  Benzenes 

29 

80 

-  Fused  Rings 

37 

56 

-  Polycyclic  Aromatics 

2>1 

49 

Non-hydrocarbons 

0 

472 

♦  Halogen  containing  compounds 

359 

♦  Heteroatom  containing  compounds 

101 

(Sulphur  or  Phosphorous) 

♦  Compounds  containing  both 

12 

halogens  &.  heteroatoms 

-  Organosulfides 

105 

-  Organophosphorous 

8 

Table  2.  Symbols  and  definitions  of  topological  parameters 


Index 


Definition 


W 

H" 

IC 

O 


loRB 


Ml 

M2 


ICr 


SICr 

CICr 


"X 


Xpc 


XCh 


'’x” 


Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of 
vertices  of  a  graph 

Mean  information  index  for  the  magnitude  of  distance 

Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of  a 
graph 

Degree  complexity 
Graph  vertex  complexity 
Graph  distance  complexity 

Information  content  of  the  distance  matrix  partitioned  by  frequency  of  occurrences 
of  distance  h 

Order  of  neighborhood  when  ICr  reaches  its  maximum  value  for  the  hydrogen- 
filled  graph 

Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 
maximum  neighborhood  of  vertices 

A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all  neighboring 
(connected)  vertices 

Mean  information  content  or  complexity  of  a  graph  based  on  the  r*^  (r  =  0-6)  order 
neighborhood  of  vertices  in  a  hydrogen-filled  graph 

Structural  information  content  for  r'*’  (r  =  0-6)  order  neighborhood  of  vertices  in  a 
hydrogen-filled  graph 

Complementary  information  content  for  r'*’  (r  =  0-6)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

Path  connectivity  index  of  order  h  =  0-6 

Cluster  connectivity  index  of  order  h  =  3-6 

Path-cluster  connectivity  index  of  order  h  =  4-6 

Chain  connectivity  index  of  order  h  =  3-6 

Bond  path  connectivity  index  of  order  h  =  0-6 
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"Xc 

“Xch 

"Xpc 

V 


‘’Xc 


h 


Xch 


h 


Xpc 


Ph 


Bond  cluster  connectivity  index  of  order  h  =  3-6 
Bond  chain  connectivity  index  of  order  h  =  3-6 
Bond  path-cluster  connectivity  index  of  order  h  =  4-6 
Valence  path  connectivity  index  of  order  h  =  0-6 
Valence  cluster  connectivity  index  of  order  h  =  3-6 
Valence  chain  connectivity  index  of  order  h  =  3-6 
Valence  path-cluster  connectivity  index  of  order  h  =  4-6 
Number  of  paths  of  length  h  =  0-10 


J  Balaban's  J  index  based  on  distance 

Balaban's  J  index  based  on  bond  types 

Balaban's  J  index  based  on  relative  electronegativities 

j''  Balaban's  J  index  based  on  relative  covalent  radii 

Triplet  Global  invariants  based  on  solutions  of  linear  equation  systems  using  the  adjacency 
matrix  (A),  distance  matrix  (D),  and  column/row  vectors:  distance  sums  (S),  atomic 
number  (Z),  number  of  non-hydrogen  atoms  (N  and  N  ),  vertex  degree  (V),  or 
numerical  constants  (1).  Notation  is  described  by  triplets  (e.g.  AZV).  Results  are 
weightings  for  each  atom  in  a  molecule.  These  weights  are  combined^by  5  possible 
formulas:  1  =  Sum  of  weights:  SiXj ;  2  =  Sum  of  squared  weights  ^Xi ;  3  -  Sum  of 
square  root  of  weights  SjXi*'^;  4  =  Sum  of  cross-product  Ei(xi  ■  xj)  ;  and  5  = 
product  of  weights  N  •  [SjXj]'^ _ _ _ 
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Table  3.  Summary  of  Variable  Clustering  for  139  Hydrocarbons 


Cluster 

Number  of 
Variables 

Representative  Variables  (max.  25%  of  total  listed) 

A1 

54 

DN^Z4,  DN^N4,  PO,  AZV4,  ASZ4,  ANN3,  ANN5,  AZN3 

A2 

19 

P7,  6xb,  6xv 

A3 

13 

Oxb,  Oxv,  ANZl 

A4 

13 

SIC6,  SIC5,  IC6 

A5 

12 

DSZi,  DSZs,  ASZi 

A6 

9 

DSZ3,  DSN5 

A7 

9 

DSN3,  DN^N, 

A8 

6 

^yCc, 

A9 

6 

DSZj,  ASZ2 

AlO 

5 

SIC, 

All 

4 

CIC, 

A12 

4 

u 

> 

A13 

4 

SIC3 

A14 

4 

^XCh 
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Table  4.  Summary  of  Variable  Clustering  for  1037  Diverse  Chemicals. 


Cluster 

Number  of 
Variables 

Representative  Variables  (max.  25%  of  total  listed) 

B1 

49 

PO,  ANN3,  ANNS,  AN13,  ANNl,  ANV4,  ASH,  DN^H 

B2 

13 

ANV1,P3,M2 

B3 

13 

AS11,AS15,DS11 

B4 

13 

6x,  6xb,  P7 

B5 

11 

ASN5,AS13,  ASNl 

B6 

10 

SIC3,  SIC4,  CIC4 

B7 

9 

5xbpc,  5xpc 

B8 

8 

ASZ2,ASZ1 

B9 

6 

5xt>c,  5xc 

BIO 

6 

3xch,  3xbch 

Bll 

6 

IC4,  IC5 

B12 

6 

CICi,  SlCi 

B13 

6 

6xvch,  6xbch 

B14 

6 

3xbc,  4xc 

B15 

4 

f 

B16 

4 

AS12 

B17 

4 

DN^Nl 

B18 

2 

ANSI 

Legends  of  figures 


Fig.  1.  Associations  between  clusters  of  descriptors  for  the  hydrocarbon  database  (A-type 
clusters)  and  the  database  with  diverse  compounds  (B-type  clusters).  Solid  lines  connect  A-type 
descriptors  with  B  type  descriptors,  and  the  numbers  of  common  descriptors  are  indicated  on 
such  lines  (when  no  number  is  indiceted,  there  is  just  one  common  descriptor).  Dashed  lateral 
lines  indicate  descriptors  that  have  no  correspondence  for  the  other  type. 

Fig.  2.  Graph  of  highly  correlated  topological  indices  (TIs)  according  to  Todeschini  et  al. 
(notation  of  TIs  as  in  Tab.  3  of  ref  ^‘).  Lines  connect  TIs  with  r  >  0.90. 
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Appendix  1, 7  Use  of  graph  invariants  in  QMSA  and 

predictive  toxicology 
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I.  INTRODUCTION 

A  contemporary  interest  in  mathematical  chemistry  is  the  characterization  of  molecular 
structure  using  graph  theoretic  formalism  [1-11].  A  graph  G  =  [V,E\  consists  of  an 
ordered  pair  of  two  sets  V  and  E,  representing  the  vertices  and  edges,  respectively.  G 
becomes  a  molecular  graph  when  the  set  V  represerits  the  set  of  atoms  in  a  molecule 
and  the  set  E  symbolizes  chemical  bonds  between  adjacent  atoms  [8]. 

Mathematical  characterization  of  molecular  graphs  (structures)  may  be 
accomplished  using  graph  invariants.  An  invariant  may  be  a  polynomial,  a  sequence  of 
numbers,  or  a  real  number.  A  real  number  characterizing  a  molecular  graph  is  called  a 
topological  index  (Tl).  TIs  quantify  different  aspects  of  molecular  architecture,  viz.,  size, 

shape,  cyclicity,  branching,  symmetry,  etc  [8].  ^ 

TIs  have  been  used  extensively  in  quantitative  structure-property/activity 
relationships  (QSPR  and  QSAR  respectively)  and  the  quantification  of  intermolecular 
similarity/dissimilarity  of  chemicals  [10-24],  In  quantitative  molecular  similarity  analysis 
(QMSA)  studies,  TIs  have  been  used  to  derive  high  dimensional  structure  spaces  where 
the  Euclidean  distance  Dij  between  a  pair  of  molecules  /  and  j  is  used  to  quantify  the 
similarity  between  them.  Similarity  measures  can  be  used  either  for  the  selection  of 
analogs  of  chemicals  or  in  the  prediction  of  the  property/activity  of  a  molecule  from  the 

property  of  its  selected  neighbor(s).  x  •  i 

In  some  of  our  recent  QSAR/QMSA  studies  we  have  used  different  similarity 

measures  derived  from  TIs  in  the  selection  of  analogs  and  Prediction  of 
properties/activities  for  diverse  sets  of  chemicals.  We  have  also  used  orthogonal 
descriptors  derived  from  a  set  of  over  100  graph  invariants  to  estimate 
bioactivity/toxicity  of  different  graphs  of  molecules.  In  this  paper  we  have  used  similarity 
measures  derived  from  TIs  in;  a)  selecting  analogs  of  an  isospectral  graph  from  a 
diverse  set  of  221  compounds,  and  b)  predicting  the  mutagenicity  of  a  set  of  113 
mutagens  and  non-mutagens  using  QMSA  methods. 


II.  METHODS 


Dstsbss&s 

A  set  of  19  pairs  of  isospectral  graphs  from  the  work  of  Balasubramanian  and  Basak 
1251  were  added  to  a  set  of  107  benzamidines  [26]  and  a  composite  set  of  76  diverse 
compounds  used  in  an  earlier  study  by  Basak  and  Grunwald  [23]  to  create  a  varied 
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library  of  221  compounds.  This  composite  library  was  created  to  provide  a  large  set 
containing  both  congeneric  and  non-congeneric  sets  to  test  analog  selection  methods. 
The  chemical  structures  for  the  19  pairs  of  isospectral  graphs  have  been  presented  in  a 
previously  [25]. 

A  second  data  set,  representing  a  subset  of  the  set  of  277  chemicals  presented  by 
Yamaguchi  et  al.  [27]  was  also  used  in  the  current  study.  This  subset  consisted  of  all 
the  chemicals  in  the  set  of  277  chemicals  that  had  reported  results  for  mutagenicity  in 
the  Ames  test,  mutagenicity  in  the  medium  term  liver  carcinogenesis  bioassay,  and 
carcinogenicity  in  the  two-year  rodent  bioassay  in  rat  and/or  mouse.  This  subseting 
resulted  in  a  set  of  113  chemicals,  68  of  which  are  classified  as  non-mutagens  and  45 
of  which  are  classified  as  mutagens  in  the  Ames  test.  This  set  of  chemicals  and  their 
observed  mutagenicity  are  reported  in  Table  1. 

Calculation  of  Topological  Indices 

The  TIs  calculated  for  this  study  are  listed  in  Table  2  and  include  Wiener  number 
[28],  molecular  connectivity  indices  as  calculated  by  Randic  [29]  and  Kier  and  Hall  [4], 
frequency  of  path  lengths  of  varying  size,  information  theoretic  indices  defined  on 
distance  matrices  of  graphs  using  the  methods  of  Bonchev  and  Trinajstic  [30]  as  well  as 
those  of  Raychaudhury  et  al.  [31],  parameters  defined  on  the  neighborhood  complexity 
of  vertices  in  hydrogen-filled  molecular  graphs  [32-34],  and  Balaban's  J  indices  [35-37]. 
The  majority  of  the  TIs  were  calculated  using  POLLY  2.3  [38].  The  J  indices  were 
calculated  using  software  developed  by  the  authors. 

The  Wiener  index  (W)  [28],  the  first  topological  index  reported  in  the  chemical 
literature,  may  be  calculated  from  the  distance  matrix  D(G)  of  a  hydrogen-suppressed 
chemical  graph  G  as  the  sum  of  the  entries  in  the  upper  triangular  distance  submatrix. 
The  distance  matrix  D(G)  of  a  nondirected  graph  G  with  n  vertices  is  a  symrnetric  n  x  n 
matrix  (d/,),  where  dy  is  equal  to  the  distance  between  vertices  v,  and  Vj  in  G.  Each 
diagonal  element  da  of  D(G)  is  zero.  We  give  below  the  distance  matrix  D{Gi)  of  the 
unlabeled  hydrogen-suppressed  graph  Gi  of  thioacetamide  (Fig.1): 
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W  is  calculated  as; 

w  =  y2Y,dij=Y.^-9, 

ij  h 

where  gn  is  the  number  of  unordered  pairs  of  vertices  whose  distance  is  h.  Thus  for 
D(Gy),  W  has  a  value  of  nine. 
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[Insert  Fig.  1  here] 


Randic’s  connectivity  index  [29],  and  higher-order  connectivity  path,  cluster,  path- 
cluster  and  chain  types  of  simple,  bond  and  valence  connectivity  parameters  were 
calculated  using  the  method  of  Kier  and  Hall  [4],  The  generalized  form  of  the  simple 
path  connectivity  index  is  as  follows: 


(2) 


where  v,,  y, .  Vh+i  are  the  degrees  of  the  vertices  in  the  path  of  length  h.  The  path 

length  parameters  (Ph),  number  of  paths  of  length  h  {h  =  0,1,..., 10)  in  the  hydrogen- 
suppressed  graph,  are  calculated  using  standard  algorithms. 

Information-theoretic  topological  indices  are  calculated  by  the  application  of 
information  theory  on  chemical  graphs.  An  appropriate  set  A  of  n  elements  is  derived 
from  a  molecular  graph  6  depending  upon  certain  structural  characteristics.  On  the 
basis  of  an  equivalence  relation  defined  on  A,  the  set  A  is  partitioned  into  disjoint 
subsets  A,  of  order  n,  (/  =  1,  2 . h,.  rii  =  n).  A  probability  distribution  is  then  assigned 


to  the  set  of  equivalence  classes: 


Ay,  A2,  ,  A/j 

Ph  P2,  »  Ph 

where  p,  -  nil  n  is  the  probability  that  a  randomly  selected  element  of  A  will  occur  in  the 
subset. 

The  mean  information  content  of  an  element  of  A  is  defined  by  Shannon  s  relation 
[39]: 


h 

/C  =  -XP/ 1092  Pi 

/«1 


(3) 


The  logarithm  is  taken  at  base  2  for  measuring  the  information  content  in  bits.  The  total 
information  content  of  the  set  A  is  then  n  x  1C.  Figure  2  provides  a  sample  calculation 
for  /Cj. 

[Insert  Fig.  2  here] 

It  is  to  be  noted  that  the  information  content  of  a  graph  G  is  not  uniquely  defined.  It 
•  depends  on  how  the  set  A  is  derived  from  G  as  well  as  on  the  equivalence  relation 
which  partitions  A  into  disjoint  subsets  A,.  For  example,  when  A  constitutes  the  vertex 
set  of  a  chemical  graph  6,  two  methods  of  partitioning  have  been  widely  used:  a) 
chromatic-number  coloring  of  G  where  two  vertices  of  the  same  color  are  considered 
equivalent,  and  b)  determination  of  the  orbits  of  the  automorphism  group  of  G  thereafter 
vertices  belonging  to  the  same  orbit  are  considered  equivalent. 
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Rashevsky  was  the  first  to  calculate  the  information  content  of  graphs  where 
"topologically  equivalent"  vertices  were  placed  in  the  same  equivalence  class  [40],  In 
Rashevsky's  approach,  two  vertices  u  and  v  of  a  graph  are  said  to  be  topologically 
equivalent  if  and  only  if  for  each  neighboring  vertex  u/  (/  =  1,  2,  k)  of  the  vertex  u, 
there  is  a  distinct  neighboring  vertex  v/  of  the  same  degree  for  the  vertex  v.  While 
Rashevsky  used  simple  linear  graphs  with  indistinguishable  vertices  to  symbolize 
molecular  structure,  weighted  linear  graphs  or  multigraphs  are  better  models  for 
conjugated  or  aromatic  molecules  because  they  more  properly  reflect  the  actual 
bonding  patterns,  /.e.,  electron  distribution. 

To  account  for  the  chemical  nature  of  vertices  as  well  as  their  bonding  pattern, 
Sarkar  et  al.  [41]  calculated  information  content  of  chemical  graphs  on  the  basis  of  an 
equivalence  relation  where  two  atoms  of  the  same  element  are  considered  equivalent  if 
they  possess  an  identical  first-order  topological  neighborhood.  Since  properties  of 
atoms  or  reaction  centers  are  often  modulated  by  stereo-electronic  characteristics  of 
distant  neighbors,  i.e.,  neighbors  of  neighbors,  it  was  deemed  essential  to  extend  this 
approach  to  account  for  higher-order  neighbors  of  vertices.  This  can  be  accomplished 
by  defining  open  spheres  for  all  vertices  of  a  chemical  graph.  If  r  is  any  non-negative 
real  number  and  v  is  a  vertex  of  the  graph  G,  then  the  open  sphere  S(v,  i)  is  defined  as 
the  set  consisting  of  all  vertices  v,  in  6  such  that  d{v,Vi)  <  r.  Therefore,  S(v,  0)  =  ,  S{v,  r) 
=  V  for  0  <  r  <  1,  and  S{v,i)  is  the  set  consisting  of  v  and  all  vertices  v,  of  G  situated  at 

unit  distance  from  v,  if  1<r<2. 

One  can  construct  such  open  spheres  for  higher  integral  values  of  r.  For  a  particular 
value  of  r,  the  collection  of  all  such  open  spheres  S(v,r),  where  v  runs  over  the  whole 
vertex  set  V,  forms  a  neighborhood  system  of  the  vertices  of  G.  A  suitably  defined 
equivalence  relation  can  then  partition  V  into  disjoint  subsets  consisting  of  vertices 
which  are  topologically  equivalent  for  order  neighborhood.  Such  an  approach  has 
been  developed  and  the  information-theoretic  indices  calculated  based  on  this  idea  are 

called  indices  of  neighborhood  symmetry  [34]. 

In  this  method,  chemicals  are  symbolized  by  weighted  linear  graphs.  Two  vertices  Uo 
and  Vo  of  a  molecular  graph  are  said  to  be  equivalent  with  respect  to  r  order 
neighborhood  if  and  only  if  corresponding  to  each  path  Uo,  ui,  ....  Ur  of  length  r,  there  is  a 

distinct  path  Vo,  v, .  Vr  of  the  same  length  such  that  the  paths  have  similar  edge 

weights  and  both  Uo  and  Vo  are  connected  to  the  same  number  and  type  of  atoms  up  to 
the  order  bonded  neighbors.  The  detailed  equivalence  relation  has  been  described  in 

earlier  studies  [34,42].  •  uu  u  -j  • 

Once  partitioning  of  the  vertex  set  for  a  particular  order  of  neighborhood  is 
completed,  ICr  is  calculated  by  Eq.  2.  Basak  et  al.  [32]  defined  another  information- 
theoretic  measure,  structural  information  content  {SICr),  which  is  calculated  as. 


SIC,  ^ICJlog^n 

where  ICr  is  calculated  from  Eq.  2  and  n  is  the  total  number  of  vertices  of  the  graph. 

Another  information-theoretic  invariant,  complementary  information  content  (ClCr) 
[43],  is  defined  as: 


CIC,  =log2  n-/C, 


(5) 
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ClCr  represents  the  difference  between  maximum  possible  complexity  of  a  graph 
(where  each  vertex  belongs  to  a  separate  equivalence  class)  and  the  realized 
topological  information  of  a  chemical  species  as  defined  by  ICr.  Sample  calculations  for 

SICi  and  CICi  have  been  included  in  Figure  2. 

The  information-theoretic  index  on  graph  distance,  Id  is  calculated  from  the  distance 

matrix  D(G)  of  a  chemical  graph  G  as  follows  [30]: 

(6) 

=W\og^W-Y,9hh\og2h 

h 

The  mean  information  index,  /d^-4s  found  by  dividing  the  informatioi^ndex  /o^by  W. 
The  information  theoretic  parameters  defined  on  the  distance  matrix,  hr  and  H  ,  were 

calculated  by  the  method  of  Raychaudhury  ef  a/ [31]. 

Balaban  defined  a  series  of  indices  based  upon  distance  sums  within  the  distance 
matrix  for  a  chemical  graph  that  he  designated  as  J  indices  [35-37].  These  indices  are 
highly  discriminating  with  low  degeneracy.  Unlike  W,  the  J  indices  range  of  values  are 
independent  of  molecular  size.  The  general  form  of  the  J  index  calculation  is  as  follows: 


ij, edges 

where  the  cyclomatic  number  \i  (or  number  of  rings  in  the  graph)  is  ^=q-n+^,  with  q 
edges  and  n  vertices  and  s/  is  the  sum  of  the  distances  of  atom  /  to  all  other  atoms  and 
Si  is  the  sum  of  the  distances  of  atom  J  to  all  other  atoms  [35].  Variants  were  proposed 
by  Balaban  for  incorporating  information  on  bond  type,  relative  electronegativities,  and 
relative  covalent  radii  [36,37]. 

Calculation  of  Atom  Pairs  .  ,ro,  a  * 

Atom  pairs  (APs)  were  calculated  using  the  method  of  Carhart  et  al  [3].  An  atom  pair 
is  defined  as  a  substructure  consisting  of  two  non-hydrogen  atoms  /  and  j  and  their 

interatomic  separation:  .  •  * 

<atom  descriptor, >  —  <separation>  —  <atom  descriptor;> 
where  <atom  descriptor>  contains  information  about  the  atomic  type,  number  of  non¬ 
hydrogen  neighbors  and  the  number  of  ti  electrons.  The  interatomic  separation  of  two 
atoms  is  the  number  of  atoms  traversed  in  the  shortest  bond-by-bond  path  containing 
both  atoms.  APs  used  in  this  study  were  calculated  by  the  APProbe  software  [43]. 

III.  STATISTICAL  METHODS  AND  COMPUTATION 
OF  INTERMOLECULAR  SIMILARITY 


Data  Reduction  ^ 

Initially,  all  Tls  were  transformed  by  the  natural  logarithm  of  the  index  plus  one.  This 
was  done  since  the  scale  of  some  Tls  may  be  several  orders  of  magnitude  greater  than 

other  Tls.  „  ^  ,  .  •  j- 

A  principal  component  analysis  (PCA)  was  used  on  the  transformed  indices  to 

minimize  the  intercorrelation  of  indices.  The  PCA  was  conducted  using  the  SAS 
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procedure  PRINCOMP  [44],  The  PCA  produces  linear  combinations  of  the  TIs,  called 
principal  components  (PCs)  which  are  derived  from  the  correlation  matrix.  The  first  PC 
has  the  largest  variance,  or  eigenvalue,  of  the  linear  combination  of  TIs.  Each 
subsequent  PC  explains  the  maximal  index  variance  orthogonal  to  the  previous  PCs, 
eliminating  any  redundancies  that  could  occur  within  the  set  of  TIs.  The  maximum 
number  of  PCs  generated  is  equal  to  the  number  of  TIs  available.  For  the  purposes  of 
this  study,  only  PCs  with  eigenvalues  greater  than  one  were  retained.  A  more  detailed 
explanation  of  this  approach  has  been  provided  in  a  previous  study  by  Basak  et  al  [13]. 
These  PCs  were  subsequently  used  to  determine  similarity  scores  as  described  below. 


Similarity  Measures 

lnt6mnol6Cul3r  similarity  was  maasurad  using  two  distinct  mathods.  Tha  AP  mathod 
uses  an  associative  measure  described  by  Carhart  et  al.  [3]  and  is  based  on  atom  pair 
descriptors.  The  measurement  is  the  ratio  of  the  number  of  shared  atom  pairs  between 
two  molecules  over  the  total  number  of  atom  pairs  present  in  the  two  molecules. 
Similarity  (S)  between  molecules  /  and  j  is  defined  as: 


s,=2C/(T,+r,)  (8) 

where  C  is  the  number  of  atom  pairs  common  to  molecule  i  and  j.  T,  and  Tj  are  the  total 
number  of  atom  pairs  in  molecule  /  and  j,  respectively.  The  numerator  is  multiplied  by  a 
factor  of  2  to  reflect  the  presence  of  shared  atom  pairs  in  both  compounds. 

The  second  similarity  method.  Euclidean  distance  (ED)  within  an  n-dimensional  PC 
space  derived  from  TIs  was  used.  ED  between  molecules  /  andy  is  defined  as. 


S(d.  -dJ 


(9) 


where  n  equals  the  number  of  dimensions  or  PCs  retained  from  the  PCA.  D//(  and  are 
the  data  values  of  the  l^*'  dimension  for  molecules  /  andy,  respectively. 

Analog  /  K-Nearest  Neighbor  Selection 

Following  the  quantification  of  intermolecular  similarity  of  the  molecules,  analogs  or 
nearest  neighbors  are  determined  on  the  basis  of  both  S  and  ED.  In  the  case  of  the  AP 
method,  two  molecules  are  considered  identical  if  S=1,  while  they  have  no  atom  pairs  in 
common  if  S=0.  The  ED  method  measures  a  distance  between  molecules,  thus  the 
lower  the  value  of  ED  the  greater  the  similarity  between  two  molecules. 

Property  Estimation 

Since  the  data  presented  in  the  work  of  Yamaguchi  et  al.  [27]  represented 
mutagenicity  as  non-mutagen  (-)  or  mutagen  (+)  this  data  was  treated  as  a  zero-one 
relationship,  where  non-mutagens  have  a  value  of  zero  and  mutagens  have  a  value  of 
one.  In  estimating  the  mutagenicity  of  the  probe  compound,  the  mean  of  the  observed 
mutagenicity  of  the  K-nearest  neighbors  was  used  as  the  estimate.  Thus,  if  the  mean 
resulted  in  a  value  greater  than  0.5,  the  compound  was  classified  as  a  mutagen. 
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However,  if  the  mean  was  equal  to  0.5,  the  compound  was  not  classified  as  the  results 
were  inconclusive. 


IV.  RESULTS 


Principal  Component  Analysis 

From  the  PCA  of  the  102  TIs,  eight  PCs  with  eigenvalues  greater  than  one  were 
retained.  These  eight  PCs  explained,  cumulatively,  95.2%  of  the  total  variance  within 
the  Tl  data  Table  3  lists  the  eigenvalues  of  the  eight  PCs,  the  proportion  of  variance 
explained  by  each  PC,  the  cumulative  variance  explained,  and  the  two  TIs  most 
correlated  with  each  individual  PC. 


Analog  Selection  ,  .  nn 

Figure  3  shows  the  results  of  the  analog  selection  for  isospectral  graph  10.1.1  using 

atom  pairs  to  derive  a  similarity  space  and  PCs  to  derive  a  Euclidean  distance  sf^ce. 
The  first  five  analogs  (neighbors)  for  the  probe  compound,  10.1.1,  are  presented  for 
each  of  the  similarity  methods. 


[Insert  Fig.  3  here] 


K-Nearest  Neighbor  Estimation  .  .  ,  .u  i  • 

Table  4  presents  the  results  for  the  prediction  of  mutagenicity  for  the  113  molecules 
over  a  range  of  K  values  (K  =  1-5)  for  both  the  AP  and  ED  methods.  The  results  are 
presented  as  percent  correctly  classified  and  over-all  percent  correct  prediction  r^es 
are  provided  as  a  means  of  comparing  the  efficacy  of  the  individual  models.  The 
variability  between  the  K  levels  is  easily  explained  by  the  problematic  nature  of  using  a 
binary  relationship  such  as  this  one  in  estimation.  When  the  number  of  neighbors  was 
even,  the  potential  for  unclassified  compounds  led  to  lower  prediction  rates  than  in  the 
case  of  an  odd  number  of  neighbors. 


V.  DISCUSSION 

The  major  objective  of  this  paper  was  to  study  the  effectiveness  of  mathematical 
invariants  in  the  characterization  of  molecular  structure  and  the  estimation  of  the  toxicity 
of  chemicals.  An  invariant  maps  a  chemical  structure  into  the  set  R  of  real  numbers.  A 
specific  invariant  may  be  used  for  the  ordering  or  partial  ordering  of  sets  of  molecules  or 
in  structure-activity  relationship  studies  [45].  A  particular  structural  invariant  quantifies 
distinct  aspects  of  molecular  structure.  Therefore,  a  combination  of  such  indices  might 
be  more  powerful  in  the  mathematical  characterization  of  molecular  structure  as 
compared  to  the  use  of  one  specific  invariant.  The  problem  arises  out  of  the  fact  that 
often  the  various  graph  theoretic  indices  of  molecular  structures  are  strongly  correlated. 
We  have  attempted  to  resolve  this  problem  through  the  implementation  of  a  PCA  to 
derive  orthogonal  variables  from  a  large  set  of  calculated  TIs,  and  using  the  orthogonal 

parameters  in  the  characterization  of  structure  [10,12,15,17,18,22,23]. 

In  the  present  study  we  have  used  calculated  atom  pairs  and  principal  components 
derived  from  TIs  to  select  structural  analogs  for  a  probe  compound  from  a  diverse  set 
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containing  closaly  related  structures.  The  result  of  this  analog  selection,  depicted  in 
Figure  3,  shows  that  the  five  neighbors  selected  by  each  of  the  methods  exhibit 
sufficient  power  to  reject  dissimilar  structures.  In  other  words,  we  may  conclude  that 
both  the  atom  pair  and  Euclidean  distance  methods  are  capable  of  choosing  similar 
molecules  from  a  collection  of  structurally  diverse  structures.  This  is  in  line  with  our 
earlier  studies  with  various  diverse  sets  of  molecules  [10,12,15,17,18,22,23]. 

The  central  paradigm  of  QSAR  holds  that  similar  structures  usually  have  similar 
properties.  To  test  this  idea,  we  selected  K-nearest  neighbors  (K=1-5)  for  each  molecule 
from  a  set  of  113  mutagens  and  non-mutagens  using  the  ED  and  AP  methods  and  used 
the  selected  nearest  neighbors  in  estimating  mutagenicity.  The  results  in  Table  4  show 
that  both  methods  lead  to  reasonably  good  estimates,  although  the  AP  method  was 
superior  to  the  ED  method. 

In  conclusion,  both  the  ED  and  AP  methods,  based  on  calculated  graph  theoretic 
structural  invariants,  did  reasonably  well  in  the  selection  of  structural  analogs  and  in  the 
estimation  of  chemical  properties  based  on  nearest  neighbors. 
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Table  1.  Mutagenicity  in  the  Ames  test  for  1 13  chemicals^ 


No.^ 

1.5 

1.6 

1.7 

1.8 

1.9 

1.10 
1.11 
1.12 

1.13 

1.14 

1.15 

1.16 

1.17 

1.18 
2.2 

2.3 

2.4 

2.5 

2.7 

2.8 

2.9 

2.10 
2.11 
2.12 

2.14 

2.15 

2.16 
2.20 
2.22 
2.26 

2.27 

2.28 

2.29 

2.30 

2.31 

2.32 

2.33 

2.34 

2.35 

2.42 

2.43 

2.44 

2.45 


Compound  Name _ _ _ _ 

butylated  hydroxyanisole  (BHA) 

caffeic  acid 

catechol 

clofibrate 

di(2-ethylhexyl)phthalate  (DEHP) 

hydroquinone 

p-methoxyphenol 

sesamol 

tamoxifen 

acetaminophen 

benzoin 

EPN 

gallic  acid 
a-tocopherol 

2-acethylaminofluorene  (AAF) 

adriamycin 

aflatoxin  B1 

benzo[alpyrene 

captafol 

captan 

carbazole 

dibutyinitrosamine  (DBN) 
diethyinitrosamine  (DEN) 
3,2'-dimethyl-4-aminobiphenyl  (DMAB) 
dlmethylnitrosamine  (DMN) 
N-ethyl-N-hydroxyethyInitrosamine  (EHEN) 
N-ethyl-N-nitrosourea  (ENU) 
hydrazobenzene 
laciocarpine 

3'-methyI-4-dimethylaminoazobenzene  (3  -Me-DAB) 

3-amino-9-ethylcarbazole 

N-nitrosooxazolidine 

N-nitrosodi-n-propylamine  (NDPA) 

N-nitrosomorpholine 

N-nitrosopiperidine 

N-nitrosopyrrolidine 

quinoline 

sterigmatocystin 

4,4’-thiodianiline 

alachlor 

aldrin 

auramine  O 

barbital _ _ _ _ _ _ _ 


Obs.  Ames 
Mutagenicity 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
0 
0 
0 
0 
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2.46  chlordane 

2.47  chlorendic  acid 

2.48  chlorobenzilate 

2.49  DDT 

2.50  dieldrin 

2.51  diethylstilbestrol 

2.53  ethenzamide 

2.54  17a-ethinyl  estradiol 

2.55  DL-ethionine 

2.56  hexachlorobenzene  (HCB) 

2.57  a-hexachlorocyclohexane  (a-HCH) 

2.58  d-limonene 

2.59  monoclotaline 

2.60  N-nitrosodiethanolamine 

2.61  phenobarbital 
2.64  safrole 

2.66  thioacetamide 

2.67  triadimefon 

2.68  trifluralin 

2.69  urethane 

2.70  polychlorinated  biphenyl  (PCB) 

2.71  malathion 

2.72  vinclozolin 

3.1  acetophenetidine  (phenacetin) 

3.2  azathioprine 

3.3  N-butyl-N-(4-hydroxybutyl)nitrosamine  (BBN) 

3.4  chrysazin  (danthron) 

3.5  4,4'-diaminodiphenylnnethane  (DDPM) 

3.6  7.12-dimethylbenz[a]anthracene  (DMBA) 

3.7  N-ethyl-N-(4-hydroxybutyl)nitrosannine  (EHBN) 

3.8  folpet 

3.9  hydrogen  peroxide 

3.11  3-methylcholanthrene  (3-MC) 

3.12  N-methyl-N’-nitro-N-nitrosoguanidine  (MNNG) 

3.13  N-methyl-N-nitrosourea  (MNU) 

3.14  8-nitroquinoline 

3.17  streptozotocin 

3.18  o-toluidine 

3.20  6-methyiquinoline 

3.21  8-methylquinoline 

3.22  nitrofrantoln 

3.23  6-nitroquinoline 

3.24  quercetin 

3.32  acetaldehyde 

3.33  atrazine 

3.34  di(2-ethylhexyl)adipate  (DEHA) _ 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

0 

0 

0 
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3.35 

1 ,1-dimethylhydrazine 

0 

0 

0 

0 

3.39 

trichloroacetic  acid 

3.42 

4-acethylaminofluorene  (AAF) 

3.43 

aspirin 

0 

0 

3.44 

butylated  hydroxytoluene  (BHT) 

3.45 

caffeine 

\J 

0 

n 

3.46 

caprolactam 

3.47 

chenodeoxicholic  acid 

U 

0 

0 

0 

0 

0 

3.49 

cypermethrin 

3.50 

deltamethrin 

3.51 

diltiazem 

3.52 

dimethylsulfoxide  (DMSO) 

3.53 

diazinon 

0 

0 

0 

0 

0 

0 

0 

3.54 

fenvalerate 

3.55 

glutathione 

3.56 

4-o-hexyl-2,3,6-trimethylhydroquinone  (HTHQ) 

3.58 

lithocolic  acid 

3.59 

d-mannitol 

3.61 

phenol 

3.64 

propyl  galiate 

\J 

0 

0 

3.65 

propylparaben 

3.66 

pyrene 

0 

3.67 

3.71 

resorcinol 

trimorphamide  _ _ — - : — — 

0 

in  +ho  nroQf^ntatinn 

1  he  numDenng  scneiiie  icicis  iw  iiic  - - -  r  x  i.u  *  ui 

bv  Yamaguchi  et  al.  [27]  where  the  numeral  before  the  decimal  place  refers  to  the  table 
in  which  the  compound  was  listed  (see  below)  and  the  numerals  after  the  decimal  refer 
to  the  compounds  location  within  the  table. 

Table  1  -  Association  between  inhibitory  results  in  the  medium-term  liver  bioassay  (Ito 

test)  and  reported  mutagenicity  and  carcinogenicity.  i-  u- 

Table  2  -  Association  between  positive  results  in  the  medium-term  liver  bioassay  (Ito 

test)  and  reported  mutagenicity  and  carcinogenicity. 

Table  3  -  Association  between  negative  results  in  the  medium-term  liver  bioassay  (Ito 
test)  and  reported  mutagenicity  and  carcinogenicity. 
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Table  2.  Symbols  and  brief  definitions  for  101  topological  indices. _ 

I'd  Information  index  for  the  magnitudes  of  distances  between  all  possible 
pairs  of  vertices  of  a  graph 

I'd  Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix 

of  a  graph 

l°  Degree  complexity 

H''  Graph  vertex  complexity 

Graph  distance  complexity 

Information  content  of  the  distance  matrix  partitioned  by  frequency  of 
occurrences  of  distance  h 


I  ORB 
O 

Ml 

M2 


ICr 


SICr 

CICr 


X 


Xc 


XCh 


XPC 


Vc 


Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 
maximum  neighborhood  of  vertices 

Order  of  neighborhood  when  ICr  reaches  its  maximum  value  for  the 
hydrogen-filled  graph 

A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 
neighboring  (connected)  vertices 

Mean  information  content  or  complexity  of  a  graph  based  on  the  r'^  (r  =  0-6) 
order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

Structural  information  content  for  r*^  (r  =  0-6)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

Complementary  information  content  for  r*^  (r  =  0-6)  order  neighborhood  of 
vertices  in  a  hydrogen-filled  graph 

Path  connectivity  index  of  order  h  =  0-6 

Cluster  connectivity  index  of  order  h  =  3-6 

Chain  connectivity  index  of  order  h  =  3-6 

Path-cluster  connectivity  index  of  order  h  =  4-6 

Bond  path  connectivity  index  of  order  h  =  0-6 

Bond  cluster  connectivity  index  of  order  h  =  3-6 
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Bond  chain  connectivity  index  of  order  h  =  3-6 

Bond  path-cluster  connectivity  index  of  order  h  =  4-6 

Valence  path  connectivity  index  of  order  h  =  0-6 

Valence  cluster  connectivity  index  of  order  h  =  3-6 

Valence  chain  connectivity  index  of  order  h  =  3-6 

Valence  path-cluster  connectivity  index  of  order  h  =  4-6 

Number  of  paths  of  length  h  =  0-10 

Balaban's  J  index  based  on  distance 

Balaban's  J  index  based  on  bond  types 

Balaban's  J  index  based  on  relative  electronegativities 

Balaban's  J  index  based  on  relative  covalent  radii 


Table  3.  Eigenvalues,  variance  explained  and  t\wo  TIs  most  correlated  with  the  eight 


principal  components. 


PC  1 

Eigenvalue 

Percent 

variance 

explained 

Cumulative 

variance 

explained 

First  most 
correlated  Tl 

.  Second  most 
correlated  Tl 

PCi 

55.52 

54.97 

54.97 

Y 

(96.5%) 

\  (96.4%) 

PC2 

12.38 

12.26 

67.23 

SIC3  (86.4%) 

SIC4  (85.5%) 

PCs 

11.73 

11.61 

78.84 

X  C^ 

,  (77.3%) 

Vch  (76.1%) 

PC4 

6.78 

6.71 

85.55 

ICo 

(55.0%) 

Vci> 

PCs 

4.60 

4.55 

90.10 

J 

(68.9%) 

y  (62.4%) 

PCs 

2.35 

2.32 

92.43 

ICo 

(-47.2%) 

SICo  (-36.4%) 

PC7 

1.65 

1.63 

94.06 

(44.4%) 

Vc  (43.5%) 

PCs 

1.16 

1.14 

95.21 

4_v 

X  c 

(-34.6%) 

(23.0%) 
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Table  4. 

kMM  rPRiiltR  for  the  Dfediction  of  mutagenicity  for  1 13  chemicals. 

Percent  Negative 
Correct 

Percent  Positive 
Correct 

Total  Percent 
Correct 

K 

AP 

ED 

AP 

ED 

AP 

ED 

1 

73.5 

75.0 

84.1 

66.7 

77.7 

71.7 

2 

66.2 

64.7 

72.7 

33.3 

68.8 

52.2 

3 

77.9 

80.9 

88.6 

53.3 

82.1 

69.9 

4 

70.6 

69.1 

77.3 

42.2 

73.2 

58.4 

5 

79.4 

77.9 

86.4 

53.3 

82.1 

68.1 
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Figure  Captions 

Figure  1  -  Unlabeled,  hydrogen-suppressed  graph  of  thioacetamide  (Gj). 

Figure  2  -  Labeled,  hydrogen-filled  graph  of  thioacetamide  (G2)  and  sample  calculations 
for  IC1,  SIC1  and  CIC1. 

Figure  2  -  Analogs  selected  for  isospectral  graph  10.1 .1 . 
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Thioacetamide 


G,:  thioacetamide  H, — Ce 
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Sg 
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1 
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V 

VI 
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(H4-H5) 

Ce 

C7 

Ns 

Sg 

Probability: 

1 

II 

III 

IV 

V 

VI 

3/9 

2/9 

1/9 

1/9 

1/9 

1/9 

IC1  =4* 

1/9  *  Log2  9  +  2/9 

*  Log2  9/2  +  3/9  * 

Log2  9/3 

=  2.419  bits 

SlCi  =  ICi/Log2  9  =  0.763  bits 


CICi  =  Log2  12  -  IC2 


=  0.751  bits 
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ABSTRACT 

Five  methods  for  characteriang  intermolecular  similarity  have  been  used  in  the  selection  of  analogs  for  a 
diverse  set  of  seventy-six  compounds.  These  methods  include  an  atom  pair  (AP)  based  similarity 
measure,  three  principal  component  spaces  derived  from  topostructural  indices,  topochemical  indices,  the 
combined  set  of  all  (topostructural  and  topochemical)  indices,  as  well  as  one  structure  space  consisting 
of  principal  components  calculated  from  physicochemical  properties.  Each  method  has  been  used  in  the 
selection  of  sets  analogs,  ran^g  from  five  to  forty  in  number  in  increments  of  five,  for  each  of  the 
seventy-six  compounds.  The  degree  of  overlap  of  the  sets  of  analogs  selected  by  the  five  separate 
methods  was  analyzed. 


KEYWORDS 

molecular  graph,  atom  pairs,  principal  components,  analog  selection,  molecular  similarity 


INTRODUCrriON 

Molecular  similarity  is  an  intuitive  concept  which  is  subjectively  understood  by  the  chemist  In  the  re^m 
of  mathematical  and  computational  chemistry,  intermolecular  similarity  can  be  objectively  quantified  in 
terms  of  descriptors  derived  from  the  molecular  structure  (Basak  et  al,  1988b;  Basak  et  al,  1997;  Carbd  et 
al,  1980;  Fisanick  et  al,  1992;  Fisanicket  al.  1994;  Johnson  et  al,  1988;  Maggiora  and  Johnson.  1990; 
Rjuidid,  1992;  Willet  and  Wmterman,  1986).  Chetrucal  structures  can  be  represented  by  various  tyi^  of 
models,  e.g.,  simple  molecular  gnphs,  multigrtq>hs,  pseudographs,  3-D  models,  and  quantum  chemical 
hamiltonian  functions.  Similarity,  being  context  specific,  is  quantified  in  terms  of  a  user-defined  set  of  - 
parameters  or  properties  of  molecules.  Consequently,  there  are  a  potentially  endless  number  of  methods 
that  one  can  define  to  quantify  intermolecular  similarity. 

In  recent  years  molecular  similarity  methods  based  on  topolopcal  and  substructural  descriptors  have 
become  popular.  Such  methods  are  based  on  different  types  of  graph  invariants  such  as  topolo^cal 
indices,  atom  paiis,  and  fragments  (Basak  and  Grunwald,  1994.  1995c;  Basak  and  Gute.  1997;  Basak  et 
al,  1988b;  Carbd  et  al,  1980;  Carhart  et  al,  1985;  Fisanick  et  al,  1992;  Johnson  et  al,  1988;  Randid,  1992; 
Willet  and  Winterman,  1986).  Similarity/dissimilarity  methods  have  been  used  in  the  clustering  of  large 
sets  of  chemicals  (Lajiness,  1990),  the  selection  of  analogs  for  toxicological  risk  assessment  (Basak  and 
Grunwald.  1994;  Basak  et  al,  1995),  and  the  estimation  of  the  physicochemical  and  biomedicinal 
properties  of  chemicals  (Basak  and  Grunwald,  1995a.  1995c;  Basak  et  al,  1996a;  Basak  and  Gute,  1997). 
Usually  some  number,  n,  of  descriptors  iS  used  to  define  the  structure  space  of  chemicals  and  either 
Euclidean  distance  in  the  n-dimensional  space  or  some  association  coefficient  is  used  to  quantify 


intermolccular  similarity.  The  basic  paradigm  underlying  molecular  similarity  analysis  is  "similar 
structures  have  similar  properties.”  However,  it  has  been  shown  that  different  molecular  similarity 
methods  select  quite  different  sets  of  analogs  from  a  specific  database  for  the  same  set  of  query 
chemicals  (Basak  and  Grunwald,  1995c).  In  the  case  of  the  automated  selection  of  analogs  for  testing 
chemicals  in  drug  design  protocols  or  toxicological  hazard  assessment  one  would  like  to  select  analogs 
by  reasonably  non-redundant  molecular  similarity  methods.  Therefore,  it  is  of  interest  to  investigate  the 
degree  to  which  various  similarity  methods  differ  from  each  ether.  In  a  previous  study  we  analyzed  tlie 
analog  selection  profiles  for  topologically-based  vis-a-vis  empirical  property-based  molecular  similarity 
techniques  in  the  selection  of  nearest  neighbors  of  molecules  (Basak  and  Grunwald,  1995c).  In  this  paper 
we  have  compared  the  analog  selection  profile  of  five  different  moleoular  similarity  methods,  four  of 
which  are  based  on  graph  invariants  and  one  is  derived  from  physicochemical  property  data. 


DATABASE  AND  PARAMETERS 

Development  of  the  database  ^ 

The  data  used  in  this  study  is  a  subset  of  the  U.S.  EPA  ASTER  system  (Russom,  1992)  which  met  the 
following  criteria.  These  compounds  have  experimental  values  for: 

1 .  Log  Logarithm  of  the  octanol/water  partition  coefficient  (hydrophobicity). 

2.  BP  Boiling  point  at  760  Torr. 

3.  MP  Melting  point. 


within  the  ASTER  database.  Kamlet  (1987)  provided  the  remaining  physicochemical  properties  used  in 
this  study.  These  four  solvatochromic  parameters  are:  ^ 


1. 

V/100 

The  molar  volume  of  a  molecule  calculated  as  its  molecular  weight  divided  by 
the  liquid  density  at  20°  C. 

2. 

a 

A  measure  of  the  hydrogen  bond  donor  acidity  of  a  compound  in  forming  a 
hydrogen  bond. 

3. 

P 

A  scale  of  the  hydrogen  bond  acceptor  basicity  of  a  compound  in  forming  a 
hydrogen  bond. 

4. 

• 

n 

A  measure  of  solute  or  solvent  dipolarity  or  polarizability  that  quantifies  the 
ability  of  a  compound  to  stabilize  a  neighboring  charge  or  dipole  by  virtue  of  its 
dielectric  effect. 

Kamlet  et  al  (1988)  describe  in  detail  the  methods  used  in  the  determination  of  these  solvatochromic 
parameters. 

Calculation  of  Atom  Pairs 

Atom  pairs  (APs)  were  calculated  using  the  method  of  Carhart  et  al  (1985).  An  atom  pair  is  defined  as  a 
substructure  which  consists  of  two  non-hydrogen  atoms  i  andy  and  their  interatomic  separation: 

<descriptor(>-<separation>-<descriptor^ 

where  <descriptor>  contains  information  about  the  element  type,  number  of  non-hydrogen -neighbors, 
and  the  number  of  it  electrons  for  each  atom.  The  interatomic  separation  of  two  atoms  is  the  number  of 
atoms  traversed  in  the  shortest  bond-by-bond  path  containing  both  atoms.  These  calculations  were 
conducted  using  the  APProbe  software  developed  by  Basak  and  Grunwald  (1993). 

Calculation  of  Topological  Indices 

The  topological  indices  used  in  this  study  have  been  calculated  using  the  program  POLLY  2.3  (Basak  et 
al.  1988a)  and  software  developed  by  the  authors  to  calculate  Balaban’s  J  indices.  A  complete  listing  of 


these  indices,  along  with  examples  of  their  calculation  have  been  given  in  detail  previously  (Basak  and 
Gute,  1997;  Basak  etal,  1997). 

The  topological  indices  were  further  divided  into  two  subsets,  topostructural  and  topochemical  indices. 
Topostructural  indices  are  topological  indices  which  only  encode  information  about  the  adjacency  and 
distances  of  the  vertices  (atoms)  within  a  graph  (molecular  structure),  irrespective  of  the  chemical  nature 
of  the  atoms  involved.  The  topochemical  indices  are  parameters  which  quantify  information  regarding 
the  topology  of  the  graph  (molecule),  as  well  as  specific  chemical  properties  of  the  atoms  and  bonds 
comprising  the  molecule.  These  indices  are  derived  from  weighted  graphs  where  each  vertex  (atom)  or 
edge  (bond)  is  properly  weighted  with  selected  chemical  information.  The  division  of  the  topological 
indices  into  these  distinct  sets  has  been  discussed  in  previous  studies  (Basak  et  al,  1996b,  1997). 


Similarity  Measures 


Two  measures  of  intermolecular  similarity  were  used  in  this  study.  The  methods  have  been  described  in 
detail  previously  (Basak  and  Grunwald,  1995b)  and  include  an  associative  measure^  using  atom  pairs 
(AP)  and  Euclidean  distance  (ED)  within  an  n-dimensional  principal  component  (PC)  space.  The 
Euclidean  distance  method  was  used  in  conjunction  with  the  topological  indices  and  the  physicochemical 
property  data. 


ANALOG  SELECTION 

Following  the  quantification  of  intermolecular  similarity  for  the  five  similarity  spaces,  the  /(T-nearest 
neighbors  or  analogs  {K=  5, 10, 15, 20, 25,  30, 35, 40)  were  determined  on  the  basis  of  the  associative 
measure  used  in  conjunction  with  the  AP  method  or  based  on  ED  within  a  principal  component  space. 


RESULTS  AND  DISCUSSION 

In  generating  the  prindpal  components  for  the  sets  of  topolo^cal  indices,  only  the  principal  components 
with  eigenvalues  greater  than  1.0  were  retained.  This  left  six  PCs  for  the  set  of  topostructural  indices 
which  cumulatively  explained  94.1%  of  the  variance  in  the  indices,  eight  PCs  for  the  set  of  topochettiical 
indices  which  explained  93.5%  of  the  variance  in  these  indices,  and  ten  PCs  for  the  set  of  all  topological 
indices  which  cumulatively  explained  95.2%  of  the  variance  in  the  topological  indices.  These  formed  the 
final  sets  of  PCs  which  were  used  in  creation  of  the  similarity  spaces  and  selection  of  analogs  for  these 
three  methods. 

p^rh  similarity  method  was  used  to  select  sets  of  analogs  for  each  of  the  seventy-six  compounds  in  the 

The  analogs  selected  by  each  set  were  compared  with  the  analogs  selected  by  every  other  method 
to  examine  the  overlap  between  the  sets  of  analogs.  The  results  of  this  comparison  ate  presented  in  Table 
1  below  as  the  arithmetic  mean  of  the  cardinalities  of  the  intersection  of  subsets  of  analogs  chosen  by  a 
particular  pair  of  similarity  methods  for  a  specific  value  of  K.  For  example,  the  topostructural  and 
topochemical  similarity  methods  selected  an  average  of  2.2  identical  analogs  out  of  five  for  the  entire  set 
of  seventy-six  chemicals.  Thus,  sUgJitly  under  half  of  the  analogs  selected  by  the  two  methods  were 
identical. 

It  is  clear  from  the  data  in  Table  1  that  the  five  molecular  sinularity  methods  studied  in  this  paper  are  not 
radically  different  from  one  another  because  they  have  a  substantial  degree  of  overlap  in  the  profile  of 
selected  neighbors.  This  is  an  interesting  observation  in  view  of  the  fact  that  the  structure  spaces  ^ 
constructed  from  such  diverse,  independent  variables  as  experimentally  determined  physicochemical 
properties  and  calculated  graph  invariants. 

A  perusal  of  the  data  also  shows  that  the  property-based  similarity  method  is  distinct  from  the  group  of 
methods  based  on  topological  indices  and  atom  pairs.  For  K  =  20,  for  example,  the  average  number  of 


common  neighbors  for  the  property-based  methods  vis-a-vis  the  topostructural,  topochemical,  all  index 
and  atom  pair-based  methods  are  8.7,  8.9,  8.6  and  8.9,  respectively.  For  the  same  value  of  K,  the  number 
of  common  analogs  for  the  topostructural  method  with  atom  pair,  topochemical  and  all  index  methods 
are  12.3,  12.2  and  13.1,  respectively. 


Table  1.  Comoarisons  of  the  overlap  in  analog  selection  for  five  distinct  similarity  methods. 

K 

S  vs  C 

S  vsT 

CvsT 

S  vsP 

CvsP 

TvsP 

S  vs  A 

C  vs  A 

T  VS  A 

P  vs  A 

5 

2.2 

2.5 

3.5 

1.2 

1.6 

1.6 

2.2 

2.1 

2.3 

1.9 

10 

5.0 

5.4 

7.1 

3.1 

3.4 

3.5 

4.8 

4.7 

5.0 

4.1 

15 

8.6 

9.2 

11.3 

5.6 

5.7 

5.7 

8.2 

7.8 

8.1 

6.3 

20 

12.2 

13.1 

15.1 

8.7 

8.9 

8.6 

12.3 

10.7 

11.0 

8.9 

25 

15.7 

16.7 

19.5 

12.1 

12.3 

11.9 

16.3 

14.3  1 

14.3 

12.1 

30 

20.0 

20.9 

23.8 

16.0 

16.6 

15.8 

19.5 

17.4 

17.4 

15.7 

35 

24.7 

25.6 

28.9 

20.5 

21.1 

20.0 

22.9 

21.4 

21.1 

20.4 

40 

30.4 

30.9 

33.9 

25.1 

25.9 

25.0 

26.6 

25.9 

25.5 

24.6 

S  =  topostructural  indices  P  =  physicochemical  parameters 

C  =  topochemical  indices  A  =  atom  pairs 

T  =  all  topological  indices _  . " 


For  the  three  similarity  methods  calculated  from  the  topological  indices,  the  topochemical  indices  seem 
to  have  more  influence  on  the  selection  of  neighbors  when  they  are  used  along  with  topostructural 
parameters  as  independent  variables.  This  is  clear  from  the  fact  that  for  almost  all  values  of  the 
topochemical  and  all  index  methods  have  a  uniformly  higher  degree  of  overlap  as  compared  to  that 
between  the  topostructural  and  all  index  methods. 

In  conclusion,  if  one  is  interested  in  selecting  only  two  candidates  from  the  set  of  five  methods  studied 
here  for  analog  selection,  the  property-based  method  and  any  one  of  the  theoretically-based  methods 
would  be  the  choice.  There  is  no  criteria  to  decide  which  of  the  four  topologically-based  methods  should 
be  selected  for  a  particular  occasion.  Further  studies  of  the  analog  selection  and  property  prediction 
profile  of  these  methods  are  necessary  to  guide  the  selection  of  a  specific  method  for  a  particular 
practical  situation. 
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Kevnorcts:  Aromatic  amines:  hierarchical  similarity;  mutagenicity:  quantum  chemical  descrip- 
tors:  topological  indices 


INTRODUCTION 

A  current  interest  in  the  fields  of  chemistry,  toxicology  and  biomedical 
sciences  is  the  prediction  of  the  property/activity  of  chemicals  from 
calculated  molecular  descriptors  11-6).  In  both  environmental  hazard 
assessment  and  pharmaceutical  drug  design,  one  has  to  deal  with  thousands 
sometimes  millions,  of  real  or  hypothetical  chemical  structures.  Most  of 
these  comoounds  have  very  little  of  the  experimental  data  necessary  for  the 
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estimation  of  their  toxicity  or  efficacy.  In  this  age  of  combinatorial  chemistry, 
one  can  synthesize  thousands  of  chemicals  very  quickly.  However,  experi¬ 
mental  testing  of  these  large  numbers  of  chemicals  would  not  be  cost  effective. 
Also,  It  IS  possible  to  create  virtual  libraries  consisting  of  billions  of  structures. 
In  this  case  one  would  like  to  know  the  toxic,  as  well  as  therapeutic,  potential 
of  such  a  vast  collection  of  chemicals.  The  experimental  data  necessary  for  the 
prediction  of  the  toxicity/activity  of  these  large  and  diverse  sets  of  chemicals 
will  not  be  available  to  us  in  the  near  future. 

This  pervasive  lack  of  experimental  data  demonstrates  the  need  for  the 
development  of  predictive  models  based  on  parameters  that  can  be  cal¬ 
culated  directly  from  a  chemical’s  molecular  structure.  Recently,  our  research 
group  has  been  involved  in  the  development  of  a  hierarchiki  approach 
to  quantitative  structure-activity  relationship  (QSAR)  model  development 
for  predicting  physicochemical,  toxicological  and  pharmacological  prop¬ 
erties  of  chemicals  using  theoretical  molecular  descriptors  [3,  6-  10].  Various 
topological  indices  (TIs)  fall  in  this  category  of  molecular  descriptors 
[1 1-23].  Balaban  has  classified  TIs  into  three  generations  based  on  whether 
they  are  integers,  real  numbers  or  a  sequence  of  numbers  [24].  Different 
classes  of  TIs  quantify  various  aspects  of  molecular  structure.  We  have  shown 
in  the  past  that  various  indices,  Wr.,  connectivity  indices  and  complexity- 
indices  developed  and  used  by  Basak  et  al.  [15- 18]  quantify  distinctly 
different  types  of  molecular  structural  information.  Such  indices  can  be  cal¬ 
culated  very  rapidly.  On  the  other  hand,  geometrical  and  quantum  chemical 
parameters  encode  information  regarding  the  stereo-electronic  aspects 
of  molecules.  These  classes  of  parameters  are  also  algorithmically  derived. 
i.e.,  they  can  be  calculated  for  any  real  or  hypothetical  molecular  structure 
without  any  input  of  experimentai  data. 

One  of  our  recent  interests  has  been  to  test  the  relative  effectiveness  of  the 
four  classes  of  theoretical  molecular  descriptors  mentioned  above  in  the 
development  of  QSARs  for  predicting  property/activity/toxicity  of  chemi¬ 
cals  [3,  6-  10].  In  this  paper  we  have  used  these  parameters  in  the  develop¬ 
ment  of  models  for  predicting  mutagenicity/non-mutagenicity  of  a  set  of  127 
aromatic  amines. 


METHODS 

Datasets 

A  set  of  127  aromatic  and  heteroaromatic  amines,  previously  collected  from 
the  literature  by  Debnath  e,  al.  [25],  were  used  to  study  mutagenicity.  The 
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mutagenicity  of  these  compounds  in  S.  Typhimurium  TA98  +  S9  microsomal 
preparation  has  been  expressed  as  positive  or  negative  mutagenicity  by 
Benigni  [26].  Compounds  included  in  this  study  and  their  mutagenic 
classification  based  on  experimentally  determined  mutagenic  potency  are 
given  in  Table  I.  Of  the  compounds  used  in  this  study,  106  were  classified  as 
mutagens  while  twenty-one  were  determined  to  be  non-mutagens. 


TABLE  !  Aromaiic  and  heteroaromatic  amines 


Chemicals 

TA9% 

(Expi.) 

r/198 

(Pred.f 

2-Bromo-7-aminofluorene 

1 

\ 

2-Melhoxy-5‘methylaniline  (p-cresidme) 
5-Aminoquinoline 

4-Ethoxyaniline  {/?-pheneiidine) 

1 

1 

1 

\ 

1 

1 

1  -  Aminonaphthalene 

4-Aminofiuorcne 

1 

1 

2-Aminoanthraccne 

1 

1 

7- Aminofluoranthene 

8- Aminoquinoline 

1 ,7-Diaminophenazinc 

1 

1 

1 

\ 

1 

1 

2-Aminonaphthalene 

4-Aminopyrene 

?-AminO‘3'-niirobiphenyl 

2.4,5>Trimcihyianil»ne 

1 

1 

1 

! 

1 

1 

i 

1 

3-Aminof!uorene 

1 

3.3'*Dichlorobenzidme 

1 

1 

1 

2.4-Dimclhylaniline  (2,4-xylidine) 
2,7-Diaminofluorcne 

1 

1 

1 

1 

3-Aminofluoranthcne 

1 

I 

2-Aminofluorcne 

1 

1 

2-Amino-4'-nurobiphen>i 

4-Aminobiphcnyl 

1 

1 

1 

1 

0 

1 

I 

3-Mcihoxy-4-mcthylanihnc  (o-cresidinc) 
2-Aminocarbazolc 

1 

2'Ammo-5-niirophcno! 

1 

1 

1 

2.2'’Diaminobiphenyl 

1 

1 

1 

2*Hydroxy-7-aminoftuorcnc 

1  -Aminophcnanlhrcne 

2.5-Dimclhylaniljne  (2.5‘Xylidinc) 

1 

1 

1 

1 

1 

1 

1 

4'Ammo-2'-niirobiphcnyl 

2*Amino-4-mcthylphenol 

1 

1 

1 

I 

] 

2-Aminophcnazinc 

4-Aminophcnylsu!fide 

2.4-Dinitroanilme 

1 

I 

I 

1 

1 

1 

2,4*Diaminoisopropylbcnzene 

\ 

1 

1 

2.4.Difluoroaniiine 

1 

1 

4,4'-Mcthylcncdianiline 

1 

1 

3.3'-Dimcthylbcnzidmc 

2-Aminofluoranthcnc 

1 

1 

i 

1 

2-Amino-3'-nitrobiphcnyl 

1  -  Aminofluoranthcne 

I 

1 

1 

1 
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TABLE  I  (Continued) 

Chemicals  TA9Z  TA9^ 

{Expt.)  {Prcii.y 

4,4'-Ethylenebis  (aniline)  I  1 

4-Chloroaniline  1  I 

2-Aminophenanthrene  1  I 

4-Fluoroaniline  I  1 

9*Aminophenanlhrene  1  1 

3,3'-Diaminobiphenyl  1  1 

2-Aminopyrene  1  I 

2.6- Dichloro-1,4-phenylenediamine  I  1 

2- Amino-7-acetamidofluorene  I  1 

2,8“Diaminophenazine  1  1 

6-Aminoquinoline  1  I 

4>Methoxy-2-methyIaniline  (m>cresidine)  I  1 

3- Amino-2'-nitrobiphenyl  1  1 

2,4'-Diamino-biphenyl  I  1 

1.6- Diaminophenazine  1  I 

4- Aminophenyldisulfide  1  1 

2-Bromo-4,6-dinitroaniline  1  1 

2,4-Diamino-M-bulylbenzene  I  0 

4-AminophenyIether  1  1 

2-Aminobiphenyi  1  1 

L9-Diaminophenazinc  I  1 

1- Aminofluorene  I  1 

8-Aminofluoranthene  1  1 

2- Chloroaniline  1  0 

2-Amino>aaa-trifluorotoluenc  1  1 

2- Amino-l*nitronaphthalene  1  1 

3- Amino-4'-nitrobiphenyl  1  1 

4- Bromoaniline  1  1 

2“Amino-4*chloropheno!  1  I 

3,3'-Dimeihoxybenzidine  I  1 

4-CyclohexyIaniline  I  1 

4-Phenoxyaniline  1  1 

4,4'-Methylenebis  (»-eihylaniline)  1  0 

2- Amino-7*Nitrofluorene 
Benzidine 

1  -Amino-4-Nitronaphthalene 
4-Amino-3'-Niirobiphcnyl 
4-Amino-4'-Niirobiphenyl 
l-Aminophenazine 
4,4'-Methylenebis  (o-fluoroanilinc) 

4-Chloro-2-nilroaniline 

3- Aminoquinoline 

3- Aminocarbazole 

4- Chloro- 1 ,2-phenylencdiaminc 

3- Aminophenanthrene 
3,4'-Diaminobiphenyl 
1-Aminoanthraccne 
1-Aminocarbazole 
9*Aminoanlhracenc 

4- Aminocarbazole 
6-Aminochryscne 
1-Aminopyrene 

4-4'-Meihylenebis(o-isopropyl-anilinc) 


0 
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Chemicals 

7/198 

{Expl.) 

7/198 

(Pred.)' 

2,7-Diaminophenazine 

1 

0 

1 

1 

1 

4-Aininophenanlhrene 

1 

2,4-Diaminotoluene 

1 

3,3'-Diaminobenzidinc 

1 

0 

1 

1 ,3-Phenylenediamine 

1 

1 

3,4-Diaminololuene 

0 

1 

1,2-Phenylenediamine 

1 

1 

3'Amino-6-meihylphenol 

1 

1 

2,4-Diaminoelhylbenzene 

0 

4,4'-Melhylcnebis  (2,6-diisopropylanilme) 

u 

0 

4,4'-Melhyienebis  {2,6-diethylaniline) 

u 

0 

4.4'-Methylencb!s  (2-nicthyl-6-f-butylanilinc) 
4!4'-Methylenebis  (2-methyl-6-isopropylaniline) 

0 

0 

0 

0 

A 

0 

0 

1 

4,4'-Mcthylenebis  (2-nnethyi-6-cihylanilinc) 

4,4'-Methylenebis  (2,6-dimethylaniline) 

1 

3-Aminobiphenyl 

u 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

a 

1 

2.3'Diaminobiphenyl 

1 

2‘Methyl-4-chloroaniline 

1 

2-Chloro-4-melhylanilinc 

1 

4-Mcthoxyaniline 

1 

3-Methoxyanilinc 

0 

Anihne 

0 

1 

3-Chloroanilinc 

3-Elhoxyanilinc 

1 

2*Elhoxyaniline 

1 

4-Aminophenol 

0 

3-Aminophcnol 

u 

0 

0 

1 

0 

1 

2-Aminophenol 

2*Mclhoxyanilinc 

1 

4-Chloro-!.3-phcnylencdiaminc 

1 

1 

1 

2-Niiro- 1 ,4-phenylcncdiaminc 

1 

1 

1 

4-NtirO'  1 ,3-phenylcnediammc 

1 

1 

1 

4-Nilro- 1 .2-phenylcncdtamine 

1 

'  The  table  reports  the  mutagenicity  of  the  aromatic 

and  hctcroaromaiic 

amines  as.  0  =  negative; 

^  TA^K  results  predicted  using  loposlructural  and  topochemical  indices 


Computation  of  Indices 

Topological  indices  used  in  this  study  have  been  calculated  by  POLLY  2.3 
[27]  which  can  calculate  a  total  of  102  indices.  These  indices  include  Wiener 
index  [281.  connectivity  indices  [1 1. 12],  information  theoretic  indices  defined 
on  distance  matrices  of  graphs  [13. 14).  a  set  of  parameters  derived  on  the 
neighborhood  complexity  of  vertices  in  hydrogen-filled  molecular  graphs 
[15-18],  as  well  as  Balaban’s  J  indices  [19-21].  Table  11  provides  brief  de¬ 
finitions  for  the  topological  indices  included  in  this  study. 
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TABLE  II  Symbols,  definitions  and  classifications  of  topological  parameters 

Toposlructural 

Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of 
vertices  of  a  graph 

Mean  information  index  for  the  magnitude  of  distance 

Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of 
a  graph 

Degree  complexity 
Graph  vertex  complexity 
Graph  distance  complexity 

Information  content  of  the  distance  matrix  partitioned  by  frequency  of 
occurrences  of  distance  h 

Order  of  neighborhood  when  IC  reaches  its  maximum  value  for  the  hydrogen- 
filled  graph 

A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 

neighboring  (connected)  vertices 

Path  connectivity  index  of  order  /i  =  0-6 

Cluster  connectivity  index  of  order  /i  =  3-6 

Chain  connectivity  index  of  order  =  3-6 

Path-cluster  connectivity  index  of  order  /i  =  4-6 

Number  of  paths  of  length  =  0  -  10 

Balaban’s  J  index  based  on  distance 

Topochcmical 

Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 
maximum  neighborhood  of  vertices 

Mean  information  content  or  complexity  of  a  graph  based  on  the  r'  (r  =  0-6) 
order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 
Structural  information  content  for  (r  =  0-6)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

Complementary  information  content  for  (r  =  0-6)  order  neighborhood  of 

vertices  in  a  hydrogen-filled  graph 

Bond  path  connectivity  index  of  order  /i  =  0-6 

Bond  cluster  connectivity  index  of  order  /i  =  3-6 

Bond  chain  connectivity  index  of  order  /i  =  3-6 

Bond  path-cluster  connectivity  index  of  order  /i  =  4-6 

Valence  path  connectivity  index  of  order  /r  =  0-6 

Valence  cluster  connectivity  index  of  order  /i  =  3-6 

Valence  chain  connectivity  index  of  order  /i  =  3-6 

Valence  path-cluster  connectivity  index  of  order  /i  =  4-6 

Balaban's  J  index  based  on  bond  types 

Balaban’s  J  index  based  on  relative  electronegativities 

Balaban’s  J  index  based  on  relative  covalent  radii 


Values  for  \ogP  and  the  quantum  chemical  parameters  €homo  and 
etuMO  were  taken  from  the  work  of  Debnath  el  al.  [25].  Octanol/water 
partition  coefficients  (log  P)  were  determined  experimentally  for  a  set  of  67 
aromatic  and  heteroaromatic  amines  and,  when  these  values  were  determined 
to  be  in  agreement  with  values  calculated  using  the  CLOGP  program  (release 
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3.54),  the  remainder  of  the  log  P  values  were  calculated  using  CLOGP  [29]. 
The  quantum  chemical  parameters  provided  by  Debnath  et  al.,  €homo  and 
etuMO  were  calculated  using  the  semi-empirical  AMI  of  MOPAC  4.10 
(Quantum  Chemistry  Program  Exchange  No.  455)  [30]. 


Data  Reduction 

Initially,  all  Tls  were  transformed  by  the  natural  logarithm  of  the  index  plus 
one.  This  was  done  since  the  scale  of  some  indices  may  be  several  orders  of 
magnitude  greater  than  that  of  other  indices  and  other  indices  may  equal  zero. 

The  set  of  95  Tls  was  partitioned  into  two  distinct  sets;  38  topostructural 
indices  and  57  topochemical  indices.  Topostructural  indices  are  indices 
which  encode  information  about  the  adjacency  and  distances  of  atoms 
(vertices)  in  molecular  structures  (graphs)  irrespective  of  the  chemical  nature 
of  the  atoms  involved  in  the  bonding  or  factors  like  hybridization  states  of 
atoms  and  number  of  core/valence  electrons  in  individual  atoms.  Topo¬ 
chemical  indices  are  parameters  which  quantify  information  regarding  the 
topology  (connectivity  of  atoms)  as  well  as  specific  chemical  properties  of 
the  atoms  comprising  a  molecule.  Topochemical  indices  are  derived  from 
weighted  molecular  graphs  where  each  vertex  (atom)  is  properly  weighted 
with  selected  chemical/physical  properties.  The  categorization  of  the  95  Tls 
into  these  sets  is  shown  in  Table  11. 

To  further  reduce  the  number  of  independent  variables  to  be  used  for  model 
construction,  the  sets  of  topostructural  and  topochemical  indices  were  further 
divided  into  subsets,  or  clusters,  based  on  the  correlation  matrix  using  the  SAS 
procedure  VARCLUS  [31].  This  variable  clustering  procedure  divides  the 
set  of  indices  into  disjoint  clusters  such  that  each  cluster  is  essentially 
unidimensional.  The  index  most  correlated  with  each  cluster,  as  well  as  any 
indices  which  were  poorly  correlated  with  the  cluster  (r  <  0.70),  were  selected 
for  model  development.  Variable  clustering  was  performed  independently  for 
both  the  to’postructural  and  topochemical  subsets. 


Statistical  Analysis  and  Hierarchical  DFA 

Selection  of  indices  for  the  final  models  was  conducted  using  all  subsets 
regression  on  the  sets  of  indices  chosen  through  variable  cluster  analysis  in 
the  SAS  procedure  REG  [32],  This  all  subsets  procedure  was  performed  on 
four  distinct  sets  of  indices:  ( 1 )  the  topostructural  indices  selected  by  variable 
clustering,  (2)  the  topostructural  indices  selected  in  all  subsets  regression  and 


124 


S.  C.  BASAK  et  al. 


the  topochemical  indices  selected  during  variable  clustering,  (3)  the 
topostructural  and  topochemical  indices  selected  in  all  subsets  regression 
and  log/^,  and  4)  the  model  chosen  for  topostructural  and  topochemical 
indices  with  log  P  and  with  the  addition  of  Ghomo  and  Glumo-  These  sets  of 
indices  were  then  used  to  develop  and  crossvalidale  discriminant  function 
models  for  classifying  the  mutagenicity/non-mutagenicity  of  the  1 27  aromatic 
and  heteroaromatic  amines.  Figure  I  illustrates  the  process  for  the  selection  of 
indices  and  formulation  of  DFA  models. 


RESULTS  AND  DISCUSSION 

In  the  first  step  of  our  hierarchical  modeling,  38  topostructural  parameters 
were  subjected  to  variable  clustering  procedure.  The  following  indices  were 
retained  from  the  five  clusters  generated:  I^i^,  IC,0,'^>;:c/xch/xPC. 

These  five  clusters  explained  a  total  variation  of  35.29  and  the  proportion  of 
the  variance  explained  was  equal  to  92.86%.  Of  the  57  topochemical  indices, 
the  following  ten  indices  were  selected  from  eight  clusters:  ICq,  IC2,  IC4, 
SlC2,SIC4,'^Xc’^A:ch^'*Xpc'"x‘”^^'  clusters  generated  from  the 

topochemical  indices  resulted  in  a  total  variation  explained  of  51.65  and  the 
proportion  of  the  variance  explained  was  equal  to  90.61%.  These  indices 
were  then  included  in  the  all  subsets  regression  procedure  for  the  selection  of 
final  indices  for  discriminant  function  analysis.  In  all  cases,  the  RSQUARE 
and  ADJRSQ  values  were  examined  as  indicators  of  model  fit,  however  the 
final  models  were  selected  based  on  the  Mallow's  Cp  statistic  (CP).  Statistics 
for  the  cluster  analysis  and  the  inter-correlation  of  the  clusters  for  the  topo¬ 
structural  indices  are  presented  in  Tables  III  and  IV,  respectively.  Similar 
statistics  for  the  variable  clustering  of  the  topochemical  indices  can  be  found 
in  Tables  V  and  VI. 

The  all  subsets  regression  of  the  eight  topostructural  indices  resulted  in 
the  selection  of  the  following  indices  for  model  development:  I^\  IC,  Pv 
These  indices  were  used  to  create  the  topostructural  DFA  model,  the 
simplest  model  in  the  hierarchy,  and  were  also  combined  with  the  ten 
topochemical  indices  to  create  the  second  model  in  the  hierarchy.  All  subsets 
regression  of  the  thirteen  topostructural  and  topochemical  indices  resulted 
in  the  selection  of  the  following  indices  for  modeling:  I^\  IC,  P3,  IQ,,  SIC2- 
These  indices  were  combined  with  logP  and  resulted  in  a  six  parameter 
model  with  log  P  added  to  the  complete  set  of  descriptors  from  the  second 
model.  Finally,  the  quantum  chemical  descriptors,  Gromo  and  Glumo.  were 
combined  with  the  set  of  six  indices  and  all  subsets  regression  was  used  again 


HIERARCHICAL  ASSESSMENT  OF  MUTAGENICITY 


125 


Topostructural  Descriptors 
38  Variables 


Cluster 

Analysis 


Topochemical  Descriptors 
57  Variables 


Cluster 

Analysis 


5  Clusters 
8  Variables 


8  Clusters 
10  Variables 


All  Subsets 
Regression 


3  Variables 


DFA 


3  Variable 
DFA 


Ehomo 


5  Variable 
DFA 


DFA 


6  Variable 
DFA 


DFA 


5  Variable 
DFA 


I'lGURE  1  Illustration  of  the  hierarchical  method  of  index  selection  and  discriminant 
function  analysis. 


to  select  the  best  parameters  for  model  consti^ction.  This  procedure  resulted 
in  the  selection  of  the  following  model;  Ip.IC,  Pj.logP, €lumo- 

Discriminant  function  analysis,  using  the  SAS  procedure  DISCRIM  [33], 
was  used  to  develop  models  for  predicting  mutagenicity/non-mutagenicity 
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TABLE  III  Slalisiics  for  the  variable  cluster  analysis  of  the  topostructural  indices 


Cluster 

Members 

Variation 

explained 

Proportion 

explained 

Second 

eigenvalue 

Index  most 
correlated 

Correlation 

1 

18 

16.99 

0.94 

0.71 

Py 

0.9918 

2 

2 

2.00 

1.00 

0.00 

0.9992 

3 

3 

2.15 

0.71 

0.72 

^XCh 

0.9104 

4 

12 

11.41 

0.95 

0.45 

/S' 

0,9977 

5 

3 

2.73 

0.91 

0.18 

'‘XPC 

0.9474 

TABLE  IV  Iniercorrelalion  of  the  clusters  generated  in  the  variable  cluster  analysis  of  the 
topostructural  indices 

Cluster 

1 

2 

3 

4  5 

1 

1.0000 

2 

0.0735 

1. 0000 

3 

0.6317 

-0.0707 

1.0000 

4 

0.9327 

0.1389 

0.3922 

1.0000 

5 

0.7131 

0.4006 

0.2275 

0.7793  1.0000 

TABLE  V  Statistics  for  the  variable  cluster  analysis  of  the  lopochemical  indices 


Cluster 

Members 

\  ' ana  turn 
explained 

Proportion 

explained 

Second 

eigenvalue 

Index  most 
correlated 

Correlation 

1 

19 

17.61 

093 

0.58 

•x’ 

0.9686 

8 

7.52 

094 

0.42 

SIC4 

0.9876 

3 

4 

3.76 

0  94 

0.24 

0.9484 

4 

6 

5.11 

0.85 

0.80 

y" 

0.8889 

5 

5 

4  72 

094 

0.23 

IC4 

0.9880 

6 

4 

3.72 

093 

0.27 

^Xch 

0.9419 

7 

6 

4.68 

0  78 

0.79 

SIC> 

0.9079 

8 

5 

4.52 

090 

0.21 

A  h 

Xpc 

0.9225 

TABLE  VI  Intcrcorrclation  of  the  clusters  generated  m  the  variable  cluster  analysis  of  the 
lopochemical  indices 


Cluster 

1 

2 

3 

4 

5 

6 

7 

8 

1 

1.0000 

-» 

-0.4121 

1 .0000 

3 

0.231 1 

-0  2150 

1  0(KM) 

4 

-0  8162 

0  4459 

-0  08X5 

1 .0000 

5 

0..3407 

0  6649 

-OOfvll 

-0  2594 

1.0000 

6 

047.t9 

0  2192 

-0  0509 

-0  4812 

0.5033 

1.0000 

7 

-0.5604 

0  4636 

-0  1072 

0.7565 

-0.0130 

-0.2089 

1 .0000 

8 

0  7805 

-0  5046 

0  5542 

-0.4287 

0.0484 

0.1481 

-0.2913 

1 .0000 

HIERARCHICAL  ASSESSMENT  OF  MUTAGENICITY 
TABLE  VII  Results  of  the  cross-validated  discriminant  function  analyse: 


Hierarchical  classes 

Toposiructural 
Topostructural  + 
Topochemical 
Topological  +  log  P 

Topological  +  log  + 
Quantum  chemical 


iSMc.p, 
I^IC.R,. 
IC0.SIC2 
I“,TC.  />,. 
lCo.SIC:,log/’ 
lX:,iC.  Py. 
log  P,  €lumo 


%  Correct 
{non-mutagens) 


%  Correct 
(mutagens) 


of  chemicals  in  the  Ames  test.  Four  distinct  models  were  developed  using  the 
indices  selected  from  the  all  subsets  regression  procedure  as  described  above. 
The  results  in  Table  VII  shows  that  all  four  models  could  predict  the  muta¬ 
genicity  of  chemicals  93%  to  95%  of  the  time  whereas  they  were  less  effective 

in  predicting  non-mutagenicity  (29%  to  43%). 

The  addition  of  topochemical  to  the  set  of  topostructural  indices  result¬ 
ing  in  the  best  predictive  model,  are  shown  in  Table  VII.  It  is  clear  from  the 
results  that  the  addition  of  topochemical  indices  to  the  set  of  topostructura 
indices  did  slightly  decrease  the  prediction  of  mutagenicity.  However,  t  ere 
was  a  significant  improvement  in  the  prediction  of  non-mutagenicity  by  the 
addition  of  topochemical  indices  to  the  set  of  independent  variables. 

Finally  the  addition  of  log  P  and  quantum  chemical  indices  did  not  make 
any  improvement  in  the  models.  This  is  in  line  with  our  earlier  work  with 
physical  and  biochemical  properties  which  showed  that  topostructural  and 
topochemical  indices  explain  most  of  the  variance  in  the  data  [3,  6  10]. 
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A  hierarchical  approach  to  quantitative  structure-activity  relationship  (QSAR)  modeling  has  been  used  to 
the  estimate  the  complement-inhibitory  potency  of  105  benzamidines.  This  hierarchical  approach  uses 
topostnictural,  topochemical,  and  geometric  parameters  in  a  stepwise  fashion  to  build  increasingly  more 
complex  models.  The  results  show  that  topostnictural  indices  alone,  specifically  P ,  predict  inhibitory  potency 
reasonably  well.  The  addition  of  topochemical  and  geometrical  parameters  to  the  set  of  descnptors  provides 
only  marginal  improvement  in  predictive  power.  However,  when  taken  alone,  the  geometric  parameter 
provides  a  more  stable  model  than  the  topostnictural  one. 


1.  INTRODUCTION 


A  recent  trend  in  structure— activity  relationships  (SAR) 
is  the  use  of  topological  and  geometric  parameters  in 
predicting  the  physicochemical,  biochemical,  and  toxicologi¬ 
cal  properties  of  molecules.'”^^  Topological  indices  (TIs) 
are  numerical  descriptors  of  molecular  topology  and  encode 
information  regarding  the  size,  shape,  branching,  and  sym¬ 
metry  of  molecular  graphs.^^  TIs  and  substructural  param¬ 
eters  have  been  very  useful  in  the  development  of  quantitative 
smicture— activity  relationship  (QSAR)  models,  in  the  quan¬ 
tification  of  the  structural  similarity  of  chemicals  and  in  the 
similarity-based  estimation  of  numerous  physical  and  bio¬ 
logical  properties  of  diverse  sets  of  molecules.^"*"”  On  the 
other  hand,  geometric  variables  such  as  total  surface  area, 
volume,  and  three-dimensional  Wiener  index  have  been 
employed  in  QSARs  pertaining  to  biomedicinal  and  toxico¬ 
logical  action  of  molecules  with  good  results.^- 
One  interesting  area  of  research  in  biochemistry,  pharma¬ 
cology.  and  toxicology  is  the  rationalization  of  the  action  of 
classes  of  chemicals  with  specialized  modes  of  action. 


Specificity  in  enzymology.  immunology,  and  toxicology 
arises  out  of  specific  structural  features  which  lead  to 
particular  types  of  interactions  between  ligands  and  their 
biotargets.  Topological  and  geometric  parameters  have  been 
used  in  the  development  of  QSARs  of  many  groups  of 
molecules  with  specific  modes  of  action.>’’  '°  »  '^  '^”^'-“«-" 
Complement  is  a  system  of  factors  occurring  in  normal 
serum  which  are  characteristically  activated  by  antibody- 
antigen  interactions  and  which  subsequently  mediate  a 
number  of  biologically  significant  consequences.^’  The 
factors  of  the  complement  system  include  at  least  20 
chemically  distinct  serum  proteins  and  glycoproteins.  These 
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Table  1.  Conflicting  Data  for  Structure  and  Log  1/C  for  Four 
Benzamidines  _ 


no. 

X 

obsd  log  1/C 

77 

3-0(CH2)30C6H4-3-NHC0NHC6H4-3‘'-S02F 

4.23 

95 

3-0(CH2)30C6H4-3-NHC0NHC6H4-3-S02F 

4.51 

97 

3-0(CH2)30C6H4-3-NHCC)C6H4-4-S02F 

4.57 

108 

3-0(CH2)30C6H4-3-NHC0C6H4-4-S02F 

5.21 

"  This  SO2F  group  should  be  me/a-  instead  of  para- 

Figure  1.  Neutral  base  structure  for  the  107  benzamidines. 

factors,  which  normally  exist  in  an  inactive  form,  are 
activated  by  “classical”  and  “alternative”  pathways.  Both 
pathways  generate  macromolecular  membrane  attack  com¬ 
plexes  which  lyse  a  variety  of  cells,  bacteria,  and  viruses.^ 
Products  of  this  activation  result  in  inflammatory  reactions 
at  the  site  of  antibody-antigen  interaction.  This  is  especially 
pronounced  in  the  case  of  organ  specific  and  systemic 
autoimmune  disorders.  Therefore,  control  of  unregulated 
complement  activation  is  important,  at  least  in  the  case  of 
autoimmune  disease. 

Hansch  and  Yoshimoto^’  carried  out  a  QSAR  study  of  a 
set  of  108  benzamidine  derivatives  using  linear  free-energy 
related  (LFER)  parameters.  This  series  of  compounds  are 
inhibitors  of  the  complement  system.  In  view  of  the  fact 
that  LFER  parameters  are  not  routinely  available  for  any 
arbitrary  chemical,  real  or  hypothetical,  it  was  of  interest  to 
see  whether  computable  parameters  such  as  TIs  and  geo¬ 
metric  indices  can  give  a  reasonable  QSAR  for  the  set  of 
benzamidines.  Therefore,  in  this  paper  we  have  carried  out 
a  comparative  study  of  the  utility  of  topological  indices  vis- 
a-vis  calculated  geometric  parameters  in  predicting  the 
complement-inhibitory  potencies  of  this  set  of  benzamidines. 
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Table  2.  Side-Chain  Structures  and  Biological  Property  Data  for  107  Benzamidines 


no. 

X 

1/log  C 

no. 

X 

l/logC 

obsd 

predict,* 

resid 

obsd 

predict." 

resid 

1 

3.5.(0CH02 

-0.452 

-0.367'’ 

-0.085 

55 

3-0(CH2)jOCfcH4-2-NHCOC6Hr2-Cl-6-SO;F 

-0.255 

-0.237 

-0.018 

2 

2-CH, 

-0.444 

-0.405 

-0.040 

56 

3-0(CH2)j0C*,H4-2>NHC0NHQHs 

-0.255 

-0.249 

-0.006 

3 

3.4-(CH3): 

-0.425 

-0.389 

-0.036 

57 

3-0(CH:)20C6H4-2-NHC0NHCfcHr2-Cl-5-S02F 

-0.250 

-0.236 

-0.014 

4 

H 

-0.4 18 

-0.417 

-0.002 

58 

3-0(CH2)30C«H4-2-NHC0NHCH2C«H4-4-S02F 

-0.250 

-0.228 

-0.022 

5 

3-OH 

-0.415 

-0.402 

-0.012 

59 

3-0(CH2)30C6H4*2.NHC0NH-C::6H2-2,4-(CH3)2-5-S02F 

-0.248 

-0.229 

-0.019 

6 

3-NHCO(CH:)3C6Hs 

-0.412 

-0.302" 

-0.1 10 

60 

3-0(CH2)30QH4-4-C00CH3 

-0.247 

-0.271 

0.025 

7 

3-CF3 

-0.410 

-0.369 

-0.041 

61 

3-0(CH2)30QH3-3-N02-4.CH3 

-0.245 

-0.273 

0.028 

8 

3-NO. 

-0.410 

-0.378 

-0.032 

62 

3-0(CH2)30QH4-3-CF3 

-0.245 

-0.273 

0.028 

9 

3-Br 

-0.405 

-0.401 

-0.004 

63 

3-0(CH2)30C6H4-2-NHCONHC6H4-4-CH3-3.S02F 

-0.245 

-0.229 

-0.015 

10 

3-CH3 

-0.398 

-0.402 

0.004 

64 

3-0(CH2)30C6H4‘4.NHC0C«H5 

-0.244 

-0.246 

0.002 

1 1 

3-OCH, 

-0.397 

-0.389 

-0.008 

65 

3-0(CH2)30C6H4-2-NHC0CH20C6H4-t-S02F 

-0.244 

-0.227 

-0.017 

12 

-0.373 

-0.339 

-0.034 

66 

3-0(CH2)j0C6H4*4-NHC0C6H4-i-CX:H3 

-0.243 

-0.236 

-0.007 

13 

3.5-(CH,)3 

-0.361 

-0.389 

0.028 

67 

3-0(CH2)30QH4-2-NHC0C6H4-3-S02F 

-0.243 

-0.238 

-0.005 

14 

3-0C,H7 

-0.355 

-0.362 

0.007 

68 

3-0(CH2)30QH4-2.NHC0CH2C6H4-4-S02F 

-0.243 

-0.233 

-0.010 

1.5 

3-i-C,Hm 

-0.355 

-0.353 

-0.002 

69 

3-0(CH2)30C6H4-3-C00CH3 

-0.242 

-0.272 

0.030 

16 

-0.351 

-0.349 

-0.001 

70 

3-0(CH:)j0C6H4>2-NHC0(CH2)2C«H4-4-S02F 

-0.242 

-0.227 

-0.014 

17 

3-C4H<, 

-0.338 

-0.362 

0.024 

71 

3-0(CH2)30C6H4^-NHC0C6H4-4-N02 

-0.239 

-0.232 

-0.007 

18 

3-CH«CHC6Hs 

-0.339 

-0.325 

-0.014 

72 

3-0(CH;)30QH4-2-NHC0C6H4-4.N02 

-0.239 

-0.241 

0.002 

19 

3-OCH7C6HS 

-0.331 

-0.326 

-0.005 

73 

3-0(CH2)30C6H4-4.NHC0NHC6H5 

-0.237 

-0.241 

0.004 

20 

-0.330 

-0.326 

-0.004 

74 

3-0(CH2)30C6H4-4-NHC0C6H4’3-N02 

-0.237 

-0.233 

-0.005 

21 

3-0C,H,) 

-0.329 

-0,327 

-0.002 

75 

3-0(CH2)30C6H4-2-NHC0(CH2)4C6H4-4-S02F 

-0.237 

-0.217 

-0.020 

22 

3-0(CH:)40QH, 

-0.325 

-0.288 

-0.037 

76 

3-0(CH:)3CX:<iH,-2-NHC0NHC6H4^-S02F 

-0.237 

-0.233 

-0.004 

23 

y-0{CHzhOCi,H, 

-0.323 

-0.306 

-0.017 

77 

3-0(CH2)30C6H4-3-NHC0NHC6H4^‘S02F 

-0.236 

-0.225 

-0.0 11 

24 

3-C*H. 

-0.323 

-0.347 

0.025 

78 

3-0(CH2)30C6H,-2-NHC0NH(CH2)2C«H4-4-S02F 

-0.236 

-0.223 

-0.014 

25 

3-0{CH:)30Qa,-4-COOH 

-0.321 

-0.277 

-0.044 

79 

3.0(CH2)40C6H4-3.NHC0C6H4-4-S02F 

-0.236 

-0.223 

-0.013 

26 

3-OC,Hn 

-0.320 

-0.338 

0.017 

80 

3-0(CH;)30C6H4-2-NHC0NHC6H3-4-C1-3-S02F 

-0.235 

-0.229 

-0.006 

27 

3-0-»-C,H„ 

-0.318 

-0.341 

0.022 

81 

3-0(CH2)40C6H4-2-NHC0C:*H3^-CH3-3-S02F 

-0.235 

-0.229 

-0.006 

28 

3.0(CH:):OC,oH7-a 

-0.312 

-0.283 

-0.030 

82 

3-0(CH2)3CX:6H4'2-NHC0C:*H2-2.4-(CH3)2-5-S02F 

-0.234 

-0.233 

-0.001 

29 

3-0(CH:)40C6H4-4-NH. 

-0.306 

-0.282 

-0.024 

83 

3-0(CH2)3OQH4-2-NHC0C6H2'2,4.Cl2*5-S02F 

-0.234 

-0.233 

-0.001 

30 

3-(CH:)4C*H. 

-0.302 

-0  306 

0.004 

84 

3-(CH2)4C6H4-2-NHC0NHC6H4-3-S02F 

-0.234 

-0.239 

0.005 

31 

3-0<CH.h0C*H4-3-N0: 

-0.301 

-0.277 

-0.024 

85 

3-0(CH2)30C6H4-3-NHC0C6H4-4.0CH3 

-0.233 

-0.237 

0.004 

32 

3-0<CH:),0C6H4-1-NH: 

-0.300 

-0.290 

-0.010 

86 

3-(CH2)4C6H4-2-NHC0NHC6H4-4.S02F 

-0.233 

-0.239 

0.007 

33 

3-(CH2);-4.C^H4N 

-0.299 

-0.326 

0.026 

87 

3-0(CH2)30C6H4-1-NHC0C6H4-4-C1 

-0.232 

-0.24 1 

0.009 

.34 

3-0iCH:)30C«H. 

-0.299 

-0.297 

-0.003 

88 

3-0(CH2)30C6H4-2.NHC0C6H,-2-CH,-5-S02F 

-0.232 

-0.236 

0.004 

35 

3-OiCHj),aH. 

-0.296 

-0.306 

0  010 

89 

3-0(CH2)40C6H4-4.NHC0NHC6H3-2.0CH,-5-S02F 

-0.232 

-0.214 

-0-018 

36 

.3-(CH:)r3-OH4N 

-0.294 

-0.326 

0.032 

90 

3-0(CH:)30C6H4-4-C6H5 

-0.230 

-0.261 

0.031 

37 

.3-<CH2)4C;H4-4.NHAc 

-0294 

-0.273 

-0.021 

91 

3-0(CH2)30C6H4-2-NHC0NHQH4-3-S02F 

-0.230 

-0.233 

0.003 

38 

3-(CH:h-2-aH.N 

-0.291 

-0.326 

0.035 

92 

3>0(CH2)30C*H40-NHC0C,H4-3.S02F 

-0.230 

-0.230 

-0.000 

39 

3-OiCH:),OaH4-2-NH: 

-0.283 

-0.29! 

0.009 

93 

3-0(CH2)20C6H4-3-NHC0C6H4*3-S0;F 

-0.229 

-0.236 

0.007 

40 

3-0(CH3)30C6H4-4-NHAc 

-0.278 

-0.265 

-0.012 

94 

3.0(CH2)30C6H4-4-CH3-3-NHC0C6H4^-S02F 

-0.229 

-0.226 

-0.003 

41 

3-(CH3)4-3-0R,N 

-0.276 

-0.306 

0.030 

95 

3-0(CH;)30C«H4‘3-NHC0NHC<,H4-3-S02F 

-0.222 

-0.226 

0.004 

42 

3-0(CH:)4C*H. 

-0.276 

-0.297 

0.020 

96 

3-0(CH2)30QH4-3-NHC0CH2C6H4^-S02F 

-0.220 

-0.226 

0.006 

43 

3-0<CH:),0C*H4-3-NHAc 

-0.270 

-0.267 

-0.003 

97 

3-0(CH2)30C6H4-3-NHC0CfcH4-4-S02F 

-0.219 

-0.229 

0.010 

44 

3-0(CHj>,OQH, -3.4-0; 

-0.265 

-0.283 

0.018 

98 

3-0(CH2)30QH4-2-NHC0NHC6H3-2-C1-5-S02F 

-0.217 

-0.230 

0.013 

45 

3-0(CHi).0C*K.-3-NH; 

-0.265 

-0.290 

0.025 

99 

3-0{CH2),0C*H4-3-NHC0CH20QH4*4-S02F 

-0.217 

-0.219 

0.002 

46 

3-0(CH2),OQH4-2-NHCOC6H4-4-SO;F 

-0.265 

-0.237 

-0.028 

100 

3-0(CH2)20C6H4-3-NHC0NHC<,H4-4-S02F 

-0.216 

-0.231 

0.015 

4^ 

3-0(CH;>,0C*H4-2-NHC0C:*H^ 

-0.265 

-0.253 

-0  012 

101 

3-0(CH;)40C*H4-3.NHC0NHC*H4^>S02F 

-0.215 

-0.220 

0.005 

4K 

3-0<CH;hOC:6H4^-OCH, 

-0.262 

-0.283 

0.022 

102 

3-0(CH2))0CfcH4-3-NHC0C6H4-4.N02 

-0.214 

-0.233 

0.019 

49 

3-0(CH;)40C4H4-4.NHCONHC^^-SO;F 

-0.260 

-0.219 

-0  040 

103 

3-0(CH2)20C6H4-3-NHC0C6H4-4-S02F 

-0.214 

-0.235 

0.021 

50 

3-CHCH;>30QH4-2-NHC0C^r2-0CH,-5-S0:F 

-0  260 

-0.233 

-0  027 

104 

3-0(CH2)40C,H4-2-NHC0NHC*H3-2-a-5-S02F 

-0.207 

-0.225 

0.018 

51 

3-0<CH;),0C*H4^-a 

-0.257 

-0.290 

0  033 

105 

3-0(CH2)30C6H4-3-NHC0NHC6H4-4.N02 

-0.204 

-0,230 

0.025 

52 

3-0(CH;>30C4i,-2-NO; 

-0.257 

-0.281 

0.024 

106 

3-0(CH2),0C6H4-4.CH3-3-NHC0NHC6H4^-S0;F 

-0.204 

-0.223 

0.018 

53 

3-0(CH;>,0C*H4-3-N0; 

-0.257 

-0.278 

0  021 

107 

3-0{CH;),0C6H4-3-NHC0NH(CH2):C6H4-4-S03F 

-0.193 

-0.215 

0.022 

54 

3-0(CH;),0QH4-3-0CH, 

-0.256 

-0.283 

0027 

'  Predicted  values  based  on  eq  2.  "  Values  for  compounds 

excluded  from  final  modeling,  provided  to  show  lack  of  fit. 

2.  METHODS 

2.1.  Database.  The  107  benzamidines  used  in  this  study 
are  those  presented  in  the  work  of  Hansch  and  Yoshimoto."^’ 
These  data  were  compiled  from  a  series  of  five  articles  by 
B.  R.  Baker/^"^^  in  which  Baker  and  his  students  determined 
experimentally  the  inhibition  of  guinea  pig  complement  by 
benzamidines.  Hansch  and  Yoshimoto  provide  the  structures 
and  measured  log  1/C  values,  where  C  is  the  micromolar 
concentration  for  50%  inhibition  of  complement  (/50),  for 
108  benzamidines.  The  numbered  ordering  used  by  Hansch 
and  Yoshimoto  will  be  used  in  this  manuscript  as  well  for 


ease  of  comparison.  In  the  process  of  coding  the  data,  it 
became  evident  that  two  of  the  compounds  had  structural 
duplicates  with  distinctly  different  values  for  log  1/C  (see 
Table  1).  Through  close  examination  of  Baker’s  work,  it 
became  evident  that  there  was  a  typographic  mistake  in 
compound  77,  while  the  error  in  compound  108  could  not 
be  accounted  for.  Thus,  compound  108  was  discarded  from 
the  set,  leaving  107  benzamidine  derivatives.  The  base 
structure  of  the  benzamidines  is  presented  in  Figure  1,  while 
their  side  chains  and  biological  activities,  both  measured  and 
estimated,  are  presented  in  Table  2. 
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Table  3. 


'D 

W 

P 

ff 

IC 

/orb 

O 

Mi 

Ml 

IC 

SIC 

CIC 

"Zpc 

■^‘pc 

V 

■'Zo, 

*^PC 

J 

p 

p 

p 

Vv. 

iDV^/ 

'^VVh 


Symbols  and  Definitions  of  Topological  and  Geometrical  Parameters _ _ _ _ _ 

information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of  vertices  of  a  graph 
mean  information  index  for  the  magnitude  of  distance 

Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of  a  graph 
degree  complexity 
graph  vertex  complexity 
graph  distance  complexity 

information  content  of  the  distance  matrix  partitioned  by  frequency  of  occurrences  of  distance  h 
infonnation  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its  maximum  neighborhood  of  vertices 
order  of  neighborhood  when  IC,  reaches  its  maximum  value  for  the  hydrogen- filled  graph 
A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all  neighboring  (connected)  vertices 

mean  infoimation  content  or  complexity  of  a  graph  based  on  the  rth  (r  =  0-6)  order  neighborhood  of  vertices  .n  a  hydrogen-filled  graph 

structural  information  content  for  rth  (r  =  0-6)  order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

complementary  information  content  for  rth  (r  =  0-6)  order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

path  connectivity  index  of  order  h  =  0-6 

cluster  connectivity  index  of  order  /i  =  3“6 

path-cluster  connectivity  index  of  order  h  =  4-6 

chain  connectivity  index  of  order  h—6 

bond  path  connectivity  index  of  order  h  =  0-6 

bond  cluster  connectivity  index  of  order  =  3-6 

bond  chain  connectivity  index  of  order  /i  =  6 

bond  path-cluster  connectivity  index  of  order  /i  =  4-6 

valence  path  connectivity  index  of  order  h  =  0—6 

valence  cluster  connectivity  index  of  order  h  — 3-6 

valence  chain  connectivity  index  of  order  A  =  6 

valence  path-cluster  connectivity  index  of  order  /i  =  4-6 

number  of  paths  of  length  h—0-\0 

Balaban’s  y  index  based  on  distance 

Balaban’s  J  index  based  on  bond  types 

Baiaban's  J  index  based  on  relative  electronegativities 

Balaban’s  J  index  based  on  relative  covalent  radii 

van  dcr  Waal’s  volume 

3-D  Wiener  number  for  the  hydrogen-suppressed  geometric  distance  matrix 

3-D  Wiener  number  for  the  hydrogen-filled  geometric  distance  matrix _ _ _ _ 


2.2.  Calculation  of  Topological  Indices  (TIs).  Topo¬ 
logical  indices  used  in  this  study  have  been  calculated  by 
POLLY  2.3.’'  These  indices  include  Wiener  index.’^  con¬ 
nectivity  indices, infonnation  theoretic  indices  defined 
on  distance  mauices  of  graphs,’®-’’  and  a  set  of  parameters 
derived  on  the  neighborhood  complexity  of  vertices  in 
hydrogen-filled  molecular  graphs'®-’* as  well  as  Balaban’s 
J  indices.®'  ®’  Table  3  gives  brief  definitions  for  the 
topological  indices  included  in  this  study. 

23.  Calculation  of  Geometrical  Indices.  Volume  (V.) 
was  calculated  using  the  Sybyf^  package  from  Tripos 
Associates.  Inc.  The  3-D  Wiener  numbers  were  calculated 
using  Sybyl  with  an  SPL  (Sybyl  Programming  Language) 
program  developed  in  our  lab.  Calculation  of  3-D  Wiener 
numbers  consists  of  the  sum  entries  in  the  upper  triangular 
submatrix  of  the  topographic  Euclidean  distance  matrix  for 
a  molecule.  The  3-D  coordinates  for  the  atoms  were 
determined  using  CONCORD  3.0.1.^^  Two  variants  of  the 
3-D  Wiener  number  were  calculated.  For  ’’’Wh,  hydrogen 
atoms  arc  included  in  the  compuUtions,  and  for  ’‘’W, 
hydrogen  atoms  are  excluded  from  the  computations. 

2.4.  Data  Reduction.  Initially,  all  TIs  were  uansformed 
by  the  natural  logarithm  of  the  index  plus  one.  This  was 
done  since  the  scale  of  some  indices  may  be  several  orders 
of  magnitude  greater  than  that  of  other  indices.  This  scaling 
was  also  done  for  the  geometric  indices  for  consistency. 

The  set  of  92  TIs  was  divided  into  two  distinct  sets; 
loposiructural  indices  (TSI)  and  topochemical  indices  (TCI). 


TSIs  are  topological  indices  which  encode  information  about 
the  adjacency  and  distances  of  atoms  (vertices)  in  molecular 
structures  (graphs)  irrespective  of  the  chemical  nature  of  the 
atoms  involved  in  the  bonding  or  factors  such  as  hybridiza¬ 
tion  states  of  atoms,  number  of  core/valence  electrons  in 
individual  atoms,  etc.  TCIs  are  parameters  which  quantify 
information  regarding  the  topology  (connectivity  of  atoms) 
as  well  as  specific  chemical  properties  of  the  atoms  compris¬ 
ing  a  molecule.  TCIs  are  derived  from  weighted  molecular 
graphs  where  each  vertex  (atom)  is  properly  weighted  with 
relevant  chemical/physical  properties.  Table  4  shows  the 
breakdown  of  the  topological  indices  into  structural  and 
chemical  indices. 

The  sets  of  TSIs  and  TCIs  were  further  divided  into 
subsets,  or  clusters,  based  on  the  correlation  matrix  by  using 
the  SAS  procedure  VARCLUS.®®  The  VARCLUS  procedure 
divides  the  set  of  indices  into  disjoint  clusters  so  that  each 
cluster  is  essentially  unidimensional. 

From  each  cluster  we  selected  the  TI  most  correlated  with 
the  cluster  as  well  as  any  TIs  which  were  poorly  correlated 
with  the  cluster  (R  <  0.70).  These  TIs  were  then  used  in 
the  modeling  of  benzamidine-mediated  inhibition  of  guinea 
pig  complement.  The  variable  clustering  and  selection  of 
TIs  was  performed  independently  for  both  the  TSI  and  TCI 
sets  of  indices. 

2.5.  Statistical  Analysis.  Regression  modeling  was 
accomplished  using  the  SAS  procedure  REG.®®  During  the 
initial  stages  of  statistical  analysis  it  became  apparent  that  it 
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Table  4.  Classification  of  Parameters  Used  in  Developing  Models 
for  Complement  Inhibition 


topological 

topochemical 

geometric 

/orb 

ICo  ~  IC6 

.3DW 

w 

SlCo  -  S1C6 

p 

ClCo  -  C1C6 

IT 

IP 

Vc  -  Yc 

IC 

Ycu 

0 

Yk  -  Yrc 

Ml 

Y  -  Y' 

M: 

"z'c  -  ‘z'c 

""x  - 

Ych 

^Xi  - 

Ypc  - 
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would  be  necessary  to  perform  an  alternative  transformation 
of  the  data.  Using  Hansch  and  Yoshimoto’s  Log  1/C 
transformation  resulted  in  residual  plots  that  showed  that  the 
variance  of  the  errors  correlated  with  the  predictions.  To 
deal  with  this  problem,  we  back  transformed  the  data  to  the 
initial  value  C  and  then  tried  several  other  transformations, 
finally  settling  on  1/Log  C  which  resulted  in  an  uncorrelated 
residual  plot.  All  subsets  linear  regression  was  then  carried 
out  on  three  distinct  sets  of  indices;  set  I—three  TSIs;  set 
II— the  TSI  used  in  model  I  and  four  TCIs;  and  set  Ill—the 
TSI  retained  in  model  I  and  the  three  geometrical  indices. 
The  regression  analysis  resulted  in  the  final  selection  of  TIs 
for  the  models. 

3.  RESULTS 

Using  only  the  lopostructural  class  of  indices,  all-subsets 
regression  resulted  in  a  one  parameter  model  to  estimate  I50: 

1 /log  C= -1.1245  +  0.4989(1°)  (1) 

n=105,  r=  0.940,  r<.  =  0.938,  5  =  0.0200,  F=785 


This  parameter  was  added  to  the  set  of  topochemical 
parameters.  Again,  all-subsets  regression  was  used  to 
develop  a  model  using  this  new  set  of  independent  variables. 
The  best  model  for  estimation  of  /50  once  again  used  only 
P.  This  being  the  case,  topochemical  parameters  were 
dropped  from  the  modeling  procedure. 

Using  all-subsets  regression  on  the  one  parameter  from 
eq  1  and  the  three  geometrical  parameters  resulted  in  the 
selection  of  a  different  one  parameter  model: 

l/log  C  =  -0.6428  +  0.0490(^°W)  (2) 

n=105,  r  =  0.943,  r,  =  0.940,  5  =  0.0196,  F  =  824 

Compounds  1  and  6  were  removed  from  all  models,  as 
they  were  both  strongly  influential  and  were  classified  as 
outliers  as  defined  by  the  studentized  range.  The  predicted 
values  from  eq  2  for  all  107  benzamidines,  including  the 
results  predicted  for  the  two  outliers,  are  presented  in  Table 
2. 

A  scatter  plot  of  the  experimental  data  for  the  107 
benzamidines  versus  the  values  predicted  using  eq  2  is 
presented  in  Figure  2.  Predicted  values  for  the  two  outliers 
have  been  included. 

4.  DISCUSSION 

The  objective  of  this  paper  was  to  study  the  relative 
effectiveness  of  lopostructural,  topochemical,  and  geometrical 
parameters  in  estimating  the  complement  inhibitory  potency 
of  a  set  of  benzamidines  based  solely  on  their  chemical 
structures.  Theoretical  structural  indices  can  be  derived  from 
distinct  models  of  molecules.  Also,  various  indices  defined 
on  the  same  representation  of  the  molecule  can  quantify 
various  aspects  of  molecular  architecture.  Recently,  we  have 
advocated  the  use  of  a  “hierarchical  QSAR  approach” 
involving  the  TSI,  TCI,  geometrical,  and  quantum  chem¬ 
ical  indices  in  the  successful  development  of  predictive 
rnodels.^^'^' 

In  comparing  our  study  to  the  work  of  Hanch  and 
Yoshimoto,"*^  it  must  be  pointed  out  that  our  models  did  little 
to  improve  on  their  QSAR  analysis  as  can  be  seen  from 
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examining  our  retransformed  results  with  the  results  of  their 
best  equation. 
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Basak  el  al.  ^05  0.943 

Hansch  and  Yoshimoto  108  0.935 


0.264 

0.258 


However,  the  LFER  approach  used  by  Hansch  and  Yoshim¬ 
oto  required  experimental  data  for  all  compounds  in  the  study 
and  significant  input  from  a  human  expert  for  the  determi¬ 
nation  of  the  three  “structural”  indicator  variables.  One 
strength  of  our  approach  to  this  problem  is  the  use  of 
nonempirical  theoretical  descriptors  which  can  be  calculated 
solely  from  the  chemical  structure.  With  these  purely 
theoretical  descriptors  we  have  modeled  the  inhibition  of 
complement  by  benzamidines  as  successfully  as  Hansch  and 
Yoshimoto  using  their  LFER  approach. 

It  is  clear  from  this  study  of  107  benzamidines  that  the 
TSI  indices  are  sufficient  to  explain  most  of  the  variance  in 
bioactivity.  The  addition  of  TCI  and  geometrical  parameters 
did  not  substantially  increase  the  predictive  power  of  the 
models.  However,  quantum  chemical  indices  were  not  used 
for  model  development  with  this  set  of  compounds. 

TSIs  encode  information  about  generalized  size  and  shape 
of  a  molecule.  The  success  of  TSI  parameters  in  explaining 
most  of  the  complement-inhibitory  action  of  these  benz¬ 
amidines  indicates  that  the  general  shape  and  size  of  these 
molecules  largely  determines  their  bioactivity.  In  some  of 
our  other  studies  we  have  found  that  the  addition  of  quantum 
chemical  indices  can  improve  the  correlation  in  cases  of 
specific  bioactivity.  Further  studies  will  focus  on  the 
contribution  of  quantum  chemical  indices  in  explaining  the 
bioaclivity  of  benzamidines. 
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Abstract 

A  contemporary  trend  in  computational  toxicology  is 
the  prediction  of  toxicity  endpoints  and  toxic  modes  of 
action  of  chemicals  from  parameters  that  can  be  calcvi- 
lated  directly  from  their  molecular  structure.  Topologi¬ 
cal.  geometrical,  substructural,  and  quantum  chemical 
parameters  fall  into  this  category.  We  have  been  in¬ 
volved  in  the  development  of  a  new  hierarchical  quanti¬ 
tative  structure-activity  relationship  (QSAR)  approach 
in  predicting  physicochemical,  biomedicinal  and  toxi¬ 
cological  properties  of  various  sets  of  chemicals.  This 
approach  uses  incrca-singly  more  complex  molecular  de¬ 
scriptors  for  model  building  in  a  graduated  manner.  In 
this  paper  we  will  apply  statistical  and  neural  net  meth¬ 
ods  in  the  development  of  QSAR  models  for  predicting 
toxicitv  of  chemicals  using  topostructural,  topocheini- 
cal,  geometrical,  and  quantum  chemical  indices.  The 
utility  and  limitations  of  the  approach  will  be  dis- 
cussed. 


Introduction 

Iti  1998  the  nuinbor  of  chemicals  registered  with  the 
Chemical  Abstract  Service  (CAS)  rose  to  over  19  mil¬ 
lion  (CAS  1999).  This  is  an  increase  of  over  3  million 
chemicals  between  1996  and  1998.  It  would  certainly 
he  desirable  to  be  able  to  test  each  of  these  chemicals 
for  their  effects  on  the  environment  and  human  health 
(which  we  refer  to  as  hazard  assessment);  however,  com¬ 
pleting  the  battery  of  tests  necessary  for  the  proper  haz¬ 
ard  assessment  of  even  a  single  compound  is  a  costly  and 
time-consuming  process.  Therefore,  there  is  simply  not 
enough  time  or  money  to  complete  these  test  batteries 
for  even  a  tinv  portion  of  the  compounds  which  are  reg¬ 
istered  today  (Menzel  1995).  An  alternative  to  these 
traditional  test  batteries  is  to  develop  computational 
models  for  hazard  assessment.  Computational  models 
are  fast  (milli-seconds  per  compound),  cheap  (less  than 
one  cent  per  compound),  and  do  not  run  the  risk  of  ad¬ 
versely  affecting  the  environment  during  testing.  Thus 
computational  models  can  easily  process  all  registeretl 
chemicals  and  flag  the  ones  that  require  further  testing. 
The  central  problem  with  this  approach  is  developing 
class  specific  models  that  can  be  considered  accurate 


enough  to  be  useful.  In  this  paper,  we  present  compu¬ 
tational  models  for  hazard  assessment  that  are  indeed 
considered  both  accurate  and  useful. 

One  of  the  fundamental  principles  of  biochetnistry  is 
that  activity  is  dictated  by  structure  (Hansch  1976). 
Following  this  principle,  one  can  use  theoretical  molec¬ 
ular  descriptors  that  quantify  structural  aspects  of  a 
molecule  to  quantitatively  determine  its  activity  (Basak 
&  Grunwald  1995;  Cramer,  Famini,  &  Lowrey  1993). 
These  theoretical  descriptors  can  be  generated  directly 
from  the  known  structure  of  the  molecule  and  used 
to  estimate  its  properties,  without  the  need  for  fur¬ 
ther  experimental  data.  This  is  important  due  to  the 
fact  that,  with  chemicals  needing  to  be  evaluated  for 
hazard  assessment,  there  is  a  scarcity  of  available  ex¬ 
perimental  data  that  is  normally  required  as  inputs 
(i.e.,  independent  variables)  to  traditional  quantitative 
structure-activity  relationship  (QSAR)  model  develop¬ 
ment.  A  QSAR  model  based  solely  on  theoretical  de¬ 
scriptors  on  the  other  hand  can  process  all  registered 
chemicals  for  hazard  assessment.  Our  recent  studies 
show  that  hierarchical  QSARs  (H-QSAR)  using  theo¬ 
retical  structural  descriptors  give  reasonable  models  for 
predicting  toxicity  (Basak,  Gute,  &  Grunwald  In  press; 
Gute  k  Basak  i997;  Gute,  Grunwald,  &  Basak  In 
press). 

One  potential  problem  with  using  our  hierarchical  ap¬ 
proach  is  that  it  often  gives  many  independent  variables 
as  compared  to  data  points.  For  instance,  in  our  case 
study  of  predicting  acute  toxicity  (LC50)  of  benzene 
derivatives,  we  have  95  independent  variables  and  69 
data  points.  Therefore,  reducing  the  number  of  inde¬ 
pendent  variables  is  critical  when  attempting  to  model 
small  data  sets.  The  smaller  the  data  set,  the  greater 
the  chance  of  spurious  error  when  using  a  large  num¬ 
ber  of  independent  variables  (descriptors).  Part  of  our 
focus  in  this  paper  is  attempting  to  reduce  the  size  of 
the  data  set. 

Hierarchical  QSAR 

Our  recent  studies  have  focused  on  the  role  of  differ¬ 
ent  classes  of  theoretical  descriptors  of  increasing  lev- 


els  of  complexity  and  their  utility  in  QSAR  (Gute  & 
Basak  1997;  Gute,  Grunwald,  k  Basak  In  press).  Four 
distinct  sets  of  theoretical  descriptors  have  been  used 
in  this  study:  topostructural,  topochemical,  geometric, 
and  quantum  chemical  indices.  Gute  and  Basak  1997 
provide  the  detailed  list  of  the  indices  included  in  our 
study. 

Topological  Indices 

The  complete  set  of  topological  indices  used  in  this 
study,  both  the  topostructural  and  the  topochemical, 
have  been  calculated  using  POLLY  2.3  (Basak,  Harriss, 
k  Magnuson  1988)  and  software  developed  by  the  au¬ 
thors.  These  indices  include  the  Wiener  index  (Wiener 
1947),  the  connectivity  indices  developed  by  Randic 
1975  and  higher  order  connectivity  indices  formulated 
by  Kier  and  Hall  1986,  bonding  connectivity  indices 
defined  by  Basak  and  Magnuson  1988,  a  set  of  infor¬ 
mation  theoretic  indices  defined  on  the  distance  matri¬ 
ces  of  simple  molecular  graphs  (Hansch  k  Leo  1995), 
and  neighborhood  complexity  indices  of  hydrogen-filled 
molecular  graphs,  and  Balaban’s  1983  J  indices. 

Geometrical  Indices 

The  geometrical  indices  are  three-dimensional  W'iener 
numbers  for  hydrogen-filled  molecular  structure, 
hydrogen-suppressed  molecular  structure,  and  van  der 
Waals  volume.  Van  der  W^aals  volume,  \\\  (Bondi 
1964),  was  calculated  using  Sybyl  6.1  from  Tripos  As¬ 
sociates,  Inc.  of  St.  Louis.  The  3-D  Wiener  numbers 
were  calculated  by  Sybyl  using  an  SPL  (Sybyl  Program¬ 
ming  Language)  program  developed  in  our  lab  (SYB^  L 
1998).  Calculation  of  3-D  Wiener  numbers  consists  of 
the  sum  entries  in  the  upper  triangular  submatrix  of  the 
topographic  Euclidean  distance  matrix  for  a  molecule. 
The  3-D  coordinates  for  the  atoms  were  determined 
using  CONCORD  3.0.1  from  Tripos  Associates,  Inc. 
Two  variants  of  the  3-D  Wiener  number  were  calcu¬ 
lated:  and  For  h,  hydrogen  atoms 

are  included  in  the  computations  and  for  hydro¬ 

gen  atoms  are  excluded  from  the  computations. 

Quantum  Chemical  Parameters 

The  following  quantum  chemical  parameters  were  cal¬ 
culated  using  the  Austin  Model  version  one  (AMI) 
semi-empirical  Hamiltonian:  energy  of  the  highest  oc¬ 
cupied  molecular  orbital  (£//oAfo)i  energy  of  the  sec¬ 
ond  highest  occupied  molecular  orbital  (£^//oa/oi). 
energy  of  the  lowest  unoccupied  molecular  orbital 
{Elumo)^  energy  of  the  second  lowest  unoccupied 
molecular  orbital  (Eujmoi)^  heat  of  formation  (A///), 
and  dipole  moment  (/i)-  These  parameters  were  calcu¬ 
lated  using  MOPAC  6,00  in  the  SYBYL  interface  (Stew¬ 
art  1990). 

Results 

We  tested  the  utility  of  our  approach  of  generating 
numerous  hierarchical  theoretical  de.scriptors  of  com¬ 


pounds  on  the  acute  aquatic  toxicity  (LC^o)  of  a  con¬ 
generic  set  of  69  benzene  derivatives.  The  data  was 
taken  from  the  work  of  Hall,  Kier  and  Phipps  1984 
where  acute  aquatic  toxicity  was  measured  in  fathead 
minnow  {Pimephales  promelas).  Their  data  was  com¬ 
piled  from  eight  other  sources,  as  well  as  some  original 
work  which  was  conducted  at  the  U.S.  Environmental 
Protection  Agency  (USEPA)  Environmental  Research 
Laboratory  in  Duluth,  Minnesota.  This  set  of  chemicals 
was  composed  of  benzene  and  68  substituted  benzene 
derivatives.  According  to  the  authors,  these  l>enzene 
derivatives  were  tested  using  methodologies  comparable 
to  their  owm  96-hour  fathead  minnow  toxicity  test  sys¬ 
tem.  The  derivatives  chosen  for  this  study  have  seven 
different  substituent  groups  that  are  present  in  at  least 
six  of  the  molecules.  These  groups  consist  of  chloro, 
bromo,  nitro,  methyl,  methoxyl,  hydroxyl,  and  amino 
substituents. 

We  studied  tw'O  classes  of  approaches  for  modeling 
toxicity:  (1)  giving  all  the  descriptors  to  a  learning  al¬ 
gorithm  (neural  networks  in  this  case),  and  (2)  reducing 
the  feature  set  before  giving  the  (reduced)  feature  set 
to  a  learning  algorithm.  Results  for  our  approaches  are 
from  leave-one-out  experiments  (i.e.,  69  training/test 
set  partitions).  Leave-one-out  works  by  leaving  one 
data  point  out  of  the  training  set  and  giving  the  remain¬ 
ing  instances  (68  in  this  case)  to  the  learning  algorithms 
for  training.  (It  is  worth  noting  that  each  member  of  the 
ensemble  sees  the  same  68  training  instances  for  each 
training/test  set  partition  and  thus  ensembles  have  no 
unfair  advantage  over  other  learners.)  This  process  is 
repeated  69  times  so  that  each  example  is  a  part  of  the 
test  set  once  and  only  once.  Leave-one-out  tests  gen¬ 
eralization  accuracy  of  a  learner,  whereas  training  set 
accuracy  tests  only  the  learner’s  ability  to  memorize. 
Generalization  error  from  the  test  set  is  the  true  test  of 
accuracy  and  is  what  we  report  here. 

Table  1  gives  our  results.  First  we  trained  neural 
networks  using  all  95  parameters.  The  networks  con¬ 
tained  15  hidden  units  and  we  trained  the  networks  for 
1000  epochs.  We  normalized  each  input  parameter  to 
a  values  between  0  and  1  before  training.  Additional 
parameter  settings  for  the  neural  networks  included  a 
learning  rate  of  0.05,  a  momentum  term  of  0.1,  and 
weights  initialized  randomly  between  -0.25  and  0.25. 
With  these  ninety-five  parameters,  the  neural  network 
obtained  a  test-set  correlation  coefficient  between  pre¬ 
dicted  toxicity  and  measured  toxicity  (explained  vari¬ 
ance)  of  =  0.868  and  a  standard  error  of  0.29.  Tar¬ 
get  toxicity  measurements  ranged  from  3.04  to  6.37. 

For  our  next  experiments,  the  VARCLUS  method  of 
SAS  1998  was  used  for  selecting  subsets  of  topostruc¬ 
tural  and  topochemical  parameters  for  QSAR  model 
development.  With  this  method,  the  set  of  topolog¬ 
ical  indices  is  first  partitioned  into  two  distinct  sets, 
the  topostructural  indices  and  the  topochemical  indices. 
To  further  reduce  the  number  of  independent  variables 
for  model  construction,  the  sets  of  topostructural  and 
topochemical  indices  were  further  divided  into  subsets, 
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Numerous  quanUtaUve  structuie-acUvity  relationships  (QSARs)  have  been  developed  using 
topochemical,  and  geometrical  molecular  descriptors.  However,  few  systematic  studies  have  been  earned 
ou^on  the  relative  effectiveness  of  these  three  classes  of  parameters  in  predicting  properties.  We  have 
carried  out  a  systematic  analysis  of  the  relative  utility  of  the  three  types  of  structure  desenpt^  in  developing 
QSAR  models  for  predicting  vapor  pressure  at  STP  for  a  set  of  476  diverse  chemicals.  The  hierarchical 
technique  has  proven  to  be  useful  in  illuminating  the  relationships  of  different  types  of  molecul^  description 
info  Jadon  to  physicochemical  property  and  is  a  useful  tool  for  limiung  the  number  of  independent  vanables 
in  linear  regression  modeling  to  avoid  the  problems  of  chance  correlations. 


1.  INTRODUCTION 

A  large  number  of  quantitative  structure-activity  relation¬ 
ship  (QSAR)  studies  have  been  reported  in  recent  literature 
using  theoretical  molecular  descriptors  in  predicting  physi¬ 
cochemical,  pharmacological,  and  toxicological  properties 
of  molecules.'-”  Such  descriptors  comprise  graph  invari¬ 
ants,  geometrical  or  3-D  parameters,  and  quantum  chemical 
indices.  One  of  the  reasons  for  the  current  upsurge  of  interest 
is  the  fact  that  such  descriptors  can  be  derived  algorithmi¬ 
cally.  i.e.,  can  be  computed  for  any  molecule,  real  or 
hypothetical,  using  standard  software.  Both  in  pharmaceuti¬ 
cal  drug  design  and  in  risk  assessment  of  chemicals,  one 
has  to  evaluate  piotential  biological  effects  of  chemicals. 
Evaluation  schemes  based  on  property— property  correlation 
paradigms  are  not  very  useful  in  practical  situations,  because, 
for  most  of  the  candidate  structures,  the  experimental  data 
necessary  for  proper  evaluation  are  not  available.  This  is 
especially  true  for  the  thousands  of  chemicals  rapidly 
produced  by  methods  of  combinatoric  chemisUy”  as  well 
as  for  the  large  number  of  chemicals  present  in  the  Toxic 
Substances  Control  Act  (TSCA)  Inventory.” 

A  large  number  of  physicochemical  and  biological  end¬ 
points  arc  necessary  for  estimating  the  ecotoxicological  fate, 
transport,  and  effects  of  environmental  pollutants.'^  The 
vapor  pressure  of  chemicals  is  important  in  determining  the 
partitioning  of  chemicals  among  different  phases  once  they 
are  released  in  the  environment.  Many  QSARs  have  been 
reported  for  predicting  normal  vapor  pressure  of  chemicals. 
Such  studies  arc  usually  carried  out  on  small  sets  of 
congeneric  chemicals.  Also,  many  QSARs  use  experimental 
data  as  inputs  in  the  model.  Therefore,  it  becomes  necessary 
to  develop  QSARs  based  on  nonempirical  parameters  which 
can  predict  the  vapor  pressure  for  a  heterogeneous  collection 
of  chemicals  so  that  such  models  are  generally  applicable. 
With  this  end  in  mind,  in  the  current  paper  we  have  carried 
out  a  QSAR  study  of  476  diverse  chemicals  using  three  types 
of  nonempirical  molecular  descriptors. 
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2.  MATERIALS  AND  METHODS 

2.1.  Normal  Vapor  Pressure  Database.  Measured 
values  for  a  subset  of  the  Toxic  Substances  Conteol  Act 
(TSCA)  Inventory”  were  obtained  from  the  ASTER  (As¬ 
sessment  Tools  for  the  Evaluation  of  Risk)  database.^  This 
subset  consisted  of  a  diverse  set  of  chemicals  where  vapor 
pressure  (pv.p)  was  measured  at  25  °C  and  over  a  pressure 
range  of  approximately  3—10  000  mmHg.  Due  to  the  size 
of  the  dataset  being  used  in  this  study,  data  for  these 
chemicals  will  not  be  listed  in  this  paper.  An  electronic  copy 
of  the  data  may  be  obtained  by  contacting  the  authors. 

2.2.  Computationof  Topological  Indices.  The  majority 
of  the  topological  indices  (Tls)  used  in  this  study  have  been 
calculated  by  the  computer  program  POLLY  2.3.-'  These 
indices  include  Wiener  index,“  the  molecular  connectivity 
indices  developed  by  Randic  and  Kier  and  Hall,'-^’  informa¬ 
tion  theoretic  indices  defined  on  distance  matrices  of 
graphs,^^  ”  and  a  set  of  parameters  derived  on  the  neighbor¬ 
hood  complexity  of  vertices  in  hydrogen-filled  molecular 
graphs.^-^-^*  Balaban’s  J  indices^’"’’  were  calculated  using 
software  developed  by  the  authors. 

van  der  Waal’s  volume  was  calculated  using 

Sybyl  6.2.’^  The  3-D  Wiener  numbers^*  were  calculated  by 
Sybyl  using  an  SPL  (Sybyl  Programming  Language)  program 
developed  by  the  authors.  Calculation  of  3-D  Wiener 
numbers  consists  of  the  summation  of  the  entries  in  the  upper 
triangular  submatrix  of  the  topographic  Euclidean  distance 
matrix  for  a  molecule.  The  3-D  coordinates  for  the  atoms 
were  determined  using  CONCORD  3.2.1.”  Two  vari^ts 
of  the  3-D  Wiener  number  were  calculated,  and  W, 
where  hydrogen  atoms  are  included  and  excluded  from  the 
computations,  respectively. 

Table  1  provides  a  complete  listing  of  all  of  the  topobgical 
and  geometrical  parameters  which  have  been  used  in  this 
study.  The  listing  includes  the  symbols  used  to  represent 
the  parameters  and  brief  definitions  for  each  of  the  param¬ 
eters. 

Two  additional  parameters  were  used  in  modeling  normal 
vapor  pressure,  HB,.  and  dipole  moment  (/a)-  HBi  is  a 
simple  hydrogen  bonding  parameter  calculated  using  a 
program  developed  by  Basak,”  which  is  based  on  the  ideas 
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Table  1.  Symbols  and  Definitions  of  Topological  and  Geometrical 
Parameters 
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information  index  for  the  magnitudes  of  distances 
between  all  possible  pairs  of  vertices  of  a  graph 
mean  information  index  for  the  magnitude  of  distance 

Wiener  index  =  half-sum  of  the  off-diagonal  elements 
of  the  distance  matrix  of  a  graph 
degree  complexity 
graph  vertex  complexity 
graph  distance  complexity 

information  content  of  the  distance  matrix  partitioned  by 
frequency  of  occurrences  of  distance  h 
information  content  or  complexity  of  the  hydrogen- 
suppressed  graph  at  its  maximum  neighborhood  of 
vertices 

order  of  neighborhood  when  IC  reaches  it  maximum 
value  for  the  hydrogen-filled  graph 
a  Zagreb  group  parameter  =  sum  of  square  of  degree 
over  all  vertices 

a  Zagreb  group  parameter  =  sum  of  cross-product  of 
degrees  over  all  neighboring  (connected)  vertices 
mean  information  content  or  complexity  of  a  graph 

based  on  the  r*(r  =  0—5)  order  neighborhood  of  vertices 
in  a  hydrogen -filled  graph 

structural  information  content  for  rth  (r  =  0-5)  order 
neighborhood  of  vertices  in  a  hydrogen-filled  graph 
complmentary  information  content  for  rth  (r  =  0-5) 

order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 
path  connectivity  index  of  order  h  =  0—6 
cluster  connectivity  index  of  order  h  =  3—6 
path-cluster  connectivity  index  of  order  h  =  4-6 
chain  connectivity  index  of  order  /i  =  5.  6 
bond  path  connectivity  index  of  order  /i  =  0— 6 
bond  cluster  connectivity  index  of  order  /i  =  3-6 
bond  chain  connectivity  index  of  order  h~  5,6 
bond  path-cluster  connectivity  index  of  order  h  =  4-6 
valence  path  connectivity  index  of  order  h  =  0-6 
valence  cluster  connectivity  index  of  order  /i  =  3— 6 
valence  chain  connectivity  index  of  order  h~  5,6 
valence  path-cluster  connectivity  index  of  order  A  =  4~6 
number  of  paths  of  length  h  =  0— 10 
Balaban's  J  index  based  on  distance 
Balaban’s  J  index  based  on  bond  types 
Balaban’s  J  index  based  on  relative  electronegativities 
Balaban’s  J  index  based  on  relative  covalent  radii 
van  dcr  Waal’s  volume 

3  D  Wiener  number  for  the  hydrogen -suppressed 
geometric  distance  matrix 

3  D  Wiener  number  for  the  hydrogen- filled  geometric 
distance  matrix 


of  Ou  el  Dipole  moment  was  calculated  using  Sybyl 
6.2 

23.  Data  Reduction.  The  set  of  92  TIs  was  partitioned 
into  two  distinct  subsets:  topostructural  indices  and  lo- 
pochemical  indices.  The  distinction  was  made  as  follows: 
topostructural  indices  encode  information  about  the  adjacency 
and  distances  of  atoms  (vertices)  in  molecular  structures 
(graphs)  irrespective  of  the  chemical  nature  of  the  atoms 
involved  in  the  bonding  or  factors  like  hybridization  states 
of  atoms  and  number  of  core/valence  electrons  in  individual 
atoms,  while  lopochemical  indices  quantify  information 
regarding  the  topology  (connectivity  of  atoms)  as  well  as 
specific  chemical  properties  of  the  atoms  comprising  a 
molecule.  Topochemical  indices  are  derived  from  weighted 
molecular  graphs  where  each  vertex  (atom)  is  properly 
weighted  with  selected  chemical/physical  properties.  These 
subsets  are  shown  in  Table  2. 

The  partitioning  of  the  indices  left  38  topostructural  indices 
and  54  topochemical  indices.  At  this  point  no  further  data 
reduction  is  called  for.  since  the  ratio  of  the  number  of 


Table  2.  Classification  of  Parameters  used  in  Modeling  Normal 
Vapor  Pressure  [logio(Pv.p)] 


topological 

lopochemical 

geometric 

Other  parameters 

/d* 

/orb 

Vw 

HBi 

IC0-IC5 

3DW 

W 

SIC0-SIC5 

P 

CICo-CICj 

hP 

0j,b_6j,b 

hP 

Yc-Vc 

IC 

Ycb  and  Vch 

0 

Ypc-Ypc 

Ml 

v-v 

Ml 

Yc-Yc 

Ych  and  Ych 

hc-'-Xc 

Ypc-Ypc 

’Zch  and  '‘xci, 

78 

*Zpc-‘Zpc 

1 

y 

o 

r 

J 

observations  in  the  training  set  (342)  to  the  total  number  of 
variables  (92  maximum)  falls  well  within  the  condition  limits 
suggested  by  Topliss  and  Edwards'^®  for  reducing  the 
probability  of  spurious  correlations  even  at  the  more 
conservative  R-  >  0.7  level. 

2.4.  Statistical  Analysis  and  Hierarchical  QSAR. 
Initially,  all  TIs  were  transformed  by  the  natural  logarithm 
of  the  index  plus  one.  This  was  done  since  the  scale  of  some 
indices  may  be  several  orders  of  magnitude  greater  than  that 
of  other  indices.  The  geometric  parameters  were  transformed 
by  the  natural  logarithm  of  the  parameter. 

Two  regression  procedures  were  used  in  developing  the 
linear  models.  When  the  number  of  independent  variables 
was  high,  typically  greater  than  25.  a  stepwise  regression 
procedure  was  used  to  maximize  the  improvement  of  the 
explained  variance  (R^).  When  the  number  of  independent 
variables  was  smaller,  all  possible  subsets  regression  was 
used.  Models  were  then  optimized  to  reduce  problems  of 
variance  inflation  and  collinearity.  Regression  modeling  was 
conducted  using  the  REG  procedure  of  the  statistical  package 
SAS.^‘ 

The  vapor  pressure  data  (pvap)  was  split  into  a  training  set 
(342  compounds)  and  a  test  set  (134  compounds),  an 
approximately  75/25  split.  Models  were  developed  using 
the  training  set  of  chemicals  and  then  used  to  predict  the 
Pvap  values  of  the  test  chemicals.  Final  models  were  then 
developed  using  the  combined  training  and  test  set  of 
chemicals. 

Five  sets  of  indices  were  used  in  model  development. 
These  sets  were  constructed  as  part  of  a  hierarchical  approach 
to  QSAR  modeling.  The  hierarchy  begins  with  the  simplest 
indices,  the  toposuiictural.  After  developing  our  initial  model 
utilizing  the  topostructural  indices,  we  increase  the  level  of 
complexity.  To  the  indices  included  in  the  best  topostructural 
model,  we  add  all  of  the  topochemical  indices  and  proceed 
to  model  pvap  using  these  parameters.  Likewise,  the  indices 
included  in  the  best  model  from  this  procedure  are  combined 
with  the  geometrical  indices  and  modeling  is  conducted  once 
again.  In  addition  to  this  hierarchical  approach,  models  were 
also  constructed  using  the  topochemical  indices  alone  and 
the  geometrical  indices  alone  for  purposes  of  comparison. 

3.  RESULTS 

Stepwise  regression  analyses  for  logio(pvap)  of  the  training 
set  of  chemicals  is  summarized  in  Table  3.  As  shown  in 
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Table  3.  Summary  of  the  Regmssion  Results  for  the  Training  Se,  and  the  Prediction  Results  for  the  Test  Set  for  the  Hierarchical  Analysis  of 


l02int/?v.n)  _ _ _ _ _ _ _ _ 

training  set  {N  -  342) 

test  set  {N  = 

134) 

parameter  class 

variables  included 

F 

5 

R‘ 

s 

topostructural 

topochemical 

geometrical 

topostructural  +  topochemical 
all  indices 
ttg  +  HBi  + 

•y,  ^yc,  P9 

SICo,  SIC2,  SIC3,  CICo.  CICi.  Vc.  'x'-  h'-  h'c.  F 
i°W,  ’°W'h.  Vw 

'x.  A.  iCi.  SIC2.  ciCi.  Yc.  'x'-  Y-  V-  zv.  Yo. 
fF.  siCi,  SIC2.  CICo.  CIC3,  Y.  ’z^  Y-  Yc,  Pb.  Pio 
'x.  Pi,  Py.  ICo.  Y-  Yc.  Y-  Y-  ’z'c.  HBi 

104.6 

126.3 
168.9 
112.5 

117.4 
160.8 

48.1 

79.2 

51.8 
80.4 
79.6 

82.9 

0.56 

0.36 

0.53 

0.35 

0.35 

0.32 

51.9 

85.8 

62.2 

84.7 

84.2 

83.1 

0.46 

0.27 

0.44 

0.28 

0.28 

0.29 

Table  3.  the  topostructural  model  using  three  parameters 
resulted  in  an  explained  variance  {R^)  of  48.1%  and  a 
standard  error  (s)  of  0.56.  Addition  of  the  topochemical 
parameters  to  the  three  topostructural  parameters  led  to  a 
significant  increase  in  the  effectiveness  of  the  model.  The 
resulting  model  used  12  parameters,  two  topostructural  and 
ten  topochemical.  This  model  had  an  R-  of  80.4%  and  i  of 
0.35.  All  subsets  regression  of  the  two  topostructural  and 
ten  topochemical  indices  retained  thus  far  and  the  three 
geometrical  indices  resulted  in  the  selection  of  the  same  12 
parameter  model,  thus  the  geometrical  indices  did  not 
contribute  significantly  to  model  development.  Several  other 
models  were  constructed  for  comparative  purposes.  Using 
topochemical  indices  only,  a  ten  parameter  model  was 
developed  which  had  an  R-  of  79.2%  and  5  of  0.36.  A 
geometncal  model  was  developed  which  utilized  all  three 
geometrical  indices  and  resulted  in  an  R^  of  51.8%  and  s  of 
0.53.  Finally,  two  additional  stepwise  models  were  devel¬ 
oped  One  model  simply  used  all  indices  for  a  comparison 
between  a  simple  stepwise  analysis  of  the  data  and  the  results 
of  the  hierarchical  procedure.  This  resulted  in  an  11 
parameter  model  with  R-  of  79.6%  and  5  of  0.35.  The  second 
model  added  two  new  parameters.  HB|  and  p.  We  thought 
that  It  might  be  possible  to  improve  our  modeling  by  adding 
in  some  other  nonempirical  parameters  which  could  be 
important  to  the  determination  of  normal  vapor  pressure.  We 
selected  the  parameters  HB|  and  p.  since  they  would  be 
important  in  intermolecular  interactions  which  could  have  a 
dramatic  effect  on  vapor  pressure.  To  look  at  the  addition 
of  these  parameters,  we  conducted  a  stepwise  regression 
analysis  using  all  topostructural,  topochemical.  and  geomeuic 
indices  so  that  we  would  be  able  to  optimize  our  model, 
just  as  we  had  done  with  the  previous  models.  The  addition 
of  these  parameters  led  to  the  selection  of  a  ten  parameter 
HKHlel  which  included  three  topostructural  indices,  nine 
topochemical  indices,  and  HB|.  This  was  the  best  model 
yet.  with  an  R'  of  82.9%  and  s  of  0.32. 

Application  of  these  six  models  to  the  test  set  of  chemical.s 
resulted  in  comparable  R-  and  s;  actually  all  models  improved 
slightly  on  their  predictions  of  the  test  set.  and  these  values 
are  also  listed  in  Table  3.  Based  on  these  results,  we  decided 
that  It  wa.s  pointless  to  develop  further  models  using  only 
geometrical  parameters.  Also,  based  on  the  findings  that 
the  geometrical  indices  did  not  contribute  significantly  to 
any  of  the  training  models,  they  were  dropped  from  the 
development  of  final  models  for  the  full  set  of  476  chemicals. 
However,  even  though  the  toposuuctural  indices  did  not 
perform  well  in  modeling  vapor  pressure  by  themselves,  they 
will  be  used  in  model  development  since  they  did  contribute 
sianificantly  to  most  of  the  models. 

Regression  analy.ses  of  the  combined  set  of  476  chemicals 
showed  similar  results  for  estimating  logiotp^ip)  as  analysis 


of  the  training  set.  Using  only  the  topostructural  indices, 
stepwise  regression  analysis  resulted  in  a  five  parameter 
model  to  estimate  vapor  pressure; 

logio(Pvap)  =  0-20(0)  “  2.56(‘x)  +  0.49('‘xc)  + 

0.79(^;^c)  +  0.98(P,o)  (D 

n  =  476,  ^‘  =  51.5%,  s  =  0.53.  F  =  99.7 

Stepwise  regression  using  the  five  topostructural  param¬ 
eters  and  all  topochemical  parameters  resulted  in  the  selection 
of  the  following  seven  parameter  model: 

log,o(Fvap)  =  8.44  -  l.nCx)  +  l-25(P,o)  "  5.69(IC,)  + 

3.91(IC2)  -  1.24(IC5)  +  1.41(yc)  “  l-^OfV)  (2) 

n  =  476,  =  79.3%.  s  =  0.34,  F  =  224.0 

Only  two  of  the  topostructural  indices  used  in  eq  1  were 
retained  by  the  stepwise  regression  procedure  used  to  produce 
eq  2;  'x  and  Pio.  The  improvement  in  R-  was  significant, 
increasing  from  51.5%  for  eq  1  to  79.3%  for  eq  2.  Also, 
the  model  error  decreased  significantly,  dropping  by  0.19 
logarithmic  units.  Since  we  have  dropped  the  geometrical 
indices,  this  becomes  our  final  hierarchical  model. 

The  stepwise  regression  analysis  of  only  topochemical 
parameters  resulted  in  a  12  parameter  model; 

logiolFyap)  ~  ■"  3.44(ICo)  -  1.33(ICn)  + 

3.47(SIC2)  +  0.87(CIC,)  -  0.48(y)  +  1.44(yc)  “ 
1.00(';t'')  -  0.41(V)  -  QloCx)  -  1.08(Vc)  + 

1.42(Vch)  -  1-230'')  (3) 

„  =  476.  R^  =  75.8%,  s  =  0.38,  F  =  120.5 

This  model  which  is  inferior  to  the  topostructural  -f 
lopochemical  model  (eq  2),  because  its  variance  explained 
is  lower  and.  more  importantly,  it  requires  more  independent 
variables  (parameters)  to  achieve  this  explanation  of  variance. 

Stepwise  regression  of  all  indices  resulted  in  the  selection 
of  an  11  parameter  model.  This  approach  selected  three 
topostructural  indices  and  eight  topochemical  indices  to  arrive 
at  the  following  model; 

log,o(Fvap)  =  7-85  -  2.56(H'')  +  l.nfV)  - 
5.01(IC,)  +  3.65(IC2)  -  0.99(IC5)  +  0.51(CIC,)  - 
1.54('/)  -  0.36(V)  -  (3.3bCx‘)  -  1.40(Vc)  (4) 

„  =  476.  R^  =  80.4%,  s  =  0.33,  F  =  173.4 
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Estimated  log,o(Pv«p) 


Log,o(p.,p) 

Figure  1.  Scatterplot  of  observed  logio(/7vip)  estimated  logio* 
(/>N4p)  using  eq  5  for  476  diverse  compounds. 


While  eq  4  shows  some  slight  improvements  over  eq  2, 
the  hierarchical  model,  eq  2  is  preferred  since  it  is  a  simpler 
model  using  seven  indices  instead  of  1 1  and  based  on  a 
companson  of  F  values  it  is  a  more  robust  model  than  that 
in  eq  4. 

Finally,  vse  conducted  the  stepwise  regression  modeling 
using  all  topostructural  and  lopochemical  indices  with  HB| 
and  u  for  the  complete  set  of  476  chemicals.  The  resulting 
ten  parameter  model  used  three  topostructural  indices,  six 
lopochemical  indices,  and  HBi: 

login(Aap»  =  9.67  -  3.66('x)  +  O.BSCPj)  +  0.74(P,)  - 
1.78(1C„)  -  3.33(S1C,)  -  0.81(C1C,)  +  2.05(V)  - 
1.73(V)  -  0.79CV)  -  0.29{HB|)  (5) 

,,  =  476.  /?-  =  84.3%.  5  =  0.29,  F=  249.5 

Equation  5  shows  marked  improvement  over  eq  2, 
justifying  the  addition  of  indices  to  the  model.  Also,  it  meets 
the  critena  on  which  eq  4  was  judged  to  be  lacking.  Overall, 
there  is  an  improvement  in  variance  explained  of  5%,  with 
a  comparable  decrease  in  standard  deviation.  A  scatter  plot 
of  observed  logio(/?vap)  versus  estimated  logio(Pv.p)  using  eq 
5  IS  presented  in  Figure  1. 

4.  DISCUSSION 

The  purpose  of  this  paper  was  2-foId:  (a)  to  study  the 
utility  of  algorithmically-derived  molecular  descriptors  in 
developing  QSAR  models  for  predicting  the  vapor  pressure 
of  chemicals  from  structure  and  b)  to  investigate  the  relative 


Table  4.  Summary  of  the  Chemical  Class  Composition  of  the 
Normal  Vapor  Pressure  Dataset 


compd  classification 

no.  of  compds 

pure 

subsliiuicd 

total  normal  vapor  pressure  dataset 

476 

hydrocarbons 

253 

non-hydrocarbons^" 

223 

nitro  compounds 

4 

3 

1 

amines 

20 

17 

3 

nitriles 

7 

6 

1 

ketones 

7 

7 

0 

halogens 

100 

95 

5 

anhydrides 

1 

1 

0 

esters 

18 

16 

2 

carboxylic  acids 

2 

2 

0 

alcohols 

10 

6 

4 

sulfides 

39 

38 

1 

thiols 

4 

4 

0 

imines 

2 

2 

0 

epoxides 

1 

1 

0 

aromatic  compounds^ 

15 

10 

4 

fused-ring  compounds'^ 

1 

1 

0 

°  The  non-hydrocarbons  are  further  broken  down  into  the  following 
groups.  *The  15  aromatic  compounds  are  a  mixture  of  11  aromatic 
hydrocarbons  and  four  aromatic  halides.  ‘‘The  only  fused-ring  com¬ 
pound  was  a  polycyclic  aromatic  hydrocarbon. 


roles  of  topostructural,  lopochemical,  and  geometrical  indices 
in  the  estimation  of  standard  vapor  pressure. 

Results  described  in  this  paper  (eqs  1-5)  show  that 
nonempirical  parameters  derived  predominantly  from  graph 
theoretic  models  of  molecules  can  estimate  normal  vapor 
pressure  of  diverse  chemicals  reasonably  well.  The  ex¬ 
plained  variance  of  data  (/?‘  =  84.3%)  is  excellent  in  view 
of  the  fact  that  the  database  of  chemicals  analyzed  in  this 
paper  is  very  diverse  (see  Table  4).  It  should  be  mentioned 
that  most  published  QSAR  models  for  the  estimation  of  vapor 
pressure  have  dealt  with  much  smaller  data  sets  with  limited 
structural  variety 

The  relative  effectiveness  of  loposUiiclural,  lopochemical, 
and  geometrical  indices  in  predicting  normal  vapor  pressure 
of  chemicals  is  evident  from  the  result  presented  above 
Equation  1  explains  over  51%  of  variance  in  the  data.  All 
parameters  used  to  derive  eq  1  are  topostructural,  he.,  they 
are  parameters  which  encode  information  about  the  adjacency 
and  distance  of  vertices  in  skeletal  molecular  graphs  without 
quantifying  any  explicit  information  about  such  chemical 
aspects  like  bond  order,  electronic  character  of  atoms,  etc. 
Yet,  the  high  explained  variance  of  the  property  indicates 
that  adjacency  and  distance  in  chemical  graphs,  being  general 
descriptors  of  molecular  size,  shape,  and  branching,  are 
important  in  predicting  properties.  This  may  explain  the 
success  of  parameters  like  simple  connectivity  indices  in 
estimating  many  diverse  properties. ‘ 

Equation  3  is  derived  only  from  lopochemical  indices.  The 
explained  variance  of  vapor  pressure  (75.8%)  shows  that 
lopochemical  parameters,  as  a  class,  explain  a  larger  fraction 
of  the  variance  as  compared  to  models  derived  from  only 
topostructural  indices  (eq  1 ).  Geometrical  parameters  were 
dropped  from  the  set  of  descriptors  after  their  limited  success 
in  prediction  for  the  training  and  test  sets.  This  is  in  line 
with  our  earlier  studies  with  normal  boiling  point  and 
hydrophobicity,  where  it  was  reported  that  the  addition  of 
geometrical  indices  could  not  significantly  improve  the 
predictive  power  of  QSAR  models  derived  from  a  combined 
set  of  topostructural  and  lopochemical  parameters.*^  It  would 
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be  interesting  to  see  whether  this  pattern  holds  good  for  other 
properties  as  well.  Finally,  the  addition  of  the  simple 
nonempirical  parameter,  H6|,  which  contains  information 
relevant  to  intermolecular  interactions  further  improves  our 
ability  to  estimate  normal  vapor  pressure  resulting  in  an 
explained  variance  of  84.3%  (eq  5). 
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chemicals  using  topological  invariants 


Table  1:  Relative  effectiveness  of  statistical  and  neu¬ 
ral  network  methods  in  estimating  LCso  of  69  benzene 
derivatives. 


Method 

R" 

Standard  Error 

Linear  regression 

0.825 

0.32 

NN  with  95  inputs 

0.868 

0.29 

NN  with  VARCLUS 

0.878 

0.28 

or  clusters,  based  on  the  correlation  matrix  using  the 
\^ARCLUS  procedure.  This  procedure  divides  the  set 
of  indices  into  disjoint  clusters,  such  that  each  cluster 
is  essentially  unidimensional.  From  each  cluster  we  se¬ 
lected  the  index  most  correlated  with  the  cluster,  as  well 
as  any  indices  which  were  poorly  correlated  with  their 
cluster  <  0.70).  The  variable  clustering  and  selec¬ 
tion  of  indices  was  performed  independently  for  both 
the  topostructural  and  topochemical  indices.  This  pro¬ 
cedure  resulted  in  a  set  of  five  topostructural  indices 
and  a  set  of  nine  topochemical  indices.  These  indices 
were  combined  with  the  three  geometric  and  six  quan¬ 
tum  chemical  parameters  described  earlier. 

The  linear  regression  approach  was  that  described 
earlier  by  Gute  and  Basak  1997.  This  study  found  that 
an  accurate  linear  regre.ssion  model  for  acute  aquatic 
toxicity  required  descriptors  from  all  four  levels  of  the 
hierarchy;  topostructural,  topochemical.  geometrical 
aiid  quantum  chemical.  This  model  utilized  seven  de¬ 
scriptors  and  obtained  an  explained  variance  (7?'^)  of 
0.863  and  a  standard  error  of  0.30.  A  leave-one-out  aj)- 
proach  was  then  implemented  to  test  the  predictivity 
of  the  model.  This  testing  resulted  in  a  model  with  an 
=  0.825  and  a  standard  error  of  0.32. 

We  also  trained  neural  networks  using  the  23  param¬ 
eters  provided  by  this  data  reduction  technique.  Tin- 
parameter  settings  for  these  networks  were  the  same  as 
the  settings  for  the  other  neural  network  experiments 
mentioned  above.  With  these  23  parameters,  the  neural 
networks  obtained  a  test-set  explained  rariance  {R^)  of 
0.878  and  a  standard  error  of  0.28.  Thus  the  inpiits 
selected  by  our  data  reduction  procedure  were  able  to 
increase  the  accuracy  of  the  neural  network. 

Discussion  and  Future  Work 

The  results  show  that  both  .statistical  and  neural  net¬ 
work  met  hods  give  acceptable  estimates  of  toxicity.  The 
neural  network  methods  produced  improvement  over 
the  statistical  model.  While  the  method  proposed  here 
has  proven  effective,  tlu're  is  much  future  work  that 
needs  to  be  completed.  For  example,  though  our  results 
demonstrate  that  our  method  is  able  to  accurately  pre¬ 
dict  toxicity  directly  from  structure,  it  would  be  inler- 
e.sting  to  know  just  how  many  compounds  are  needl'd  to 
learn  an  accurate  model  of  toxicity.  Future  work,  then, 
is  to  empirically  answer  this  question.  We  plan  to  run 
our  techniques  on  further  reduced  data  sets  and  plot 
leave-one-out  accuracy.  This  would  allow  one  to  look 


at  a  curve  that  plots  accuracy  versus  training  set  .size 
and  decide  how  many  compounds  need  to  be  explicitly 
tested  for  toxicity. 

In  the  machine  learning  literature,  the  process  of  find¬ 
ing  and  removing  the  variables  that  are  unhelpful  or 
destructive  to  learning  is  called  feature  selection  (Ko- 
havi  &  John  1997).  Previous  work  on  feature  selection 
has  focused  on  finding  the  appropriate  subset  of  rele¬ 
vant  features  to  be  used  in  constructing  otic  inference 
model,  such  as  our  approach  presented  in  this  paper; 
however,  it  is  appropriate  to  start  considering  feature 
selection  with  regards  to  ensembles.  An  ensemble  is  a 
combination  of  the  outputs  from  a  set  of  models  that  are 
generated  from  separately  trained  inductive  learning  al¬ 
gorithms.  Ensembles  have  been  shown,  in  most  cases, 
to  greatly  improve  generalization  accuracy  over  a  single 
learning  model  (Breiman  1996a;  Maclin  &  Opitz  1997, 
Opitz  k  Shavlik  1996b;  Shapire  et  al.  1997).  Recent 
research  has  shown  that  an  effective  ensemble  should 
consist  of  a  set  of  models  that  are  not  only  highly'  cor¬ 
rect,  but  ones  that  make  their  errors  on  different  parts 
of  the  input  space  as  well  (Hansen  k  Salamon  1990; 
Krogh  k  Vedelsby  1995;  Opitz  k  Shavlik  1996a). 

Varying  the  feature  subsets  used  by  each  member 
of  the  ensemble  helps  promote  the  necessary  diversity 
and  create  a  more  effective  ensemble  (Opitz  submit¬ 
ted).  Thus,  this  concept  is  particularly  appropriate  for 
large  feature  sets  of  partially'  correlated  inputs,  such  as 
found  in  hazard  assessment  of  compounds.  Ensemble 
feature  selection  algorithms,  then,  not  only  have  the 
traditional  feature-selection  criteria  of  needing  to  find 
feature  subsets  that  are  germane  to  the  particular  task 
and  inductive-learning  algorithm,  but  have  the  addi¬ 
tional  criterion  of  finding  a  set  of  features  subsets  that 
will  promote  disagreement  among  the  component  mem¬ 
bers  of  the  ensemble. 

The  ensemble  techniques  we  plan  to  test  are  anal¬ 
ogous  to  the  popular  and  successful  ensemble  ap¬ 
proach  Bagging  (Breiman  1996b).  Bagging  is  a  sta¬ 
tistical  “boot-strap”  (Efron  k  Tibshirani  1993)  ensem¬ 
ble  method  that  creates  individuals  for  its  ensemble  by 
training  each  predictor  on  a  random  redistribution  of 
the  training  set.  Each  predictor’s  training  set  is  gen¬ 
erated  by  randomly  drawing,  with  replacement,  N  ex¬ 
amples  -  where  N  is  the  size  of  the  original  training 
set;  many  of  the  original  examples  may  be  repeated 
in  the  resulting  training  set  while  others  may  be  left 
out.  Each  individual  predictor  in  the  ensemble  is  gen¬ 
erated  with  a  different  random  sampling  of  the  training 
sot.  Breiman  1996a  showed  that  Bagging  is  effective  on 
’’unstable”  learning  algorithms  w'here  small  changes  in 
the  training  set  result  in  large  changes  in  predictions. 
This  shows  that,  on  average,  more  diversity  is  created 
among  the  predictors  by  varying  our  training  set  in  this 
manner  than  is  lost  in  individual  predictor  accuracy  by 
not  training  each  predictor  on  the  whole  data  set. 

Bagging  is  not  appropriate  for  most  toxicity  domains 
since  they  are  data  poor  and  one  cannot  afford  to  waste 
training  exam{)les;  however,  these  domains  are  feature 


rich  and  thus  we  can  attempt  to  create  diversity  by 
instead  varying  the  inputs  to  the  learning  algorithms. 
Thus  we  plan  to  test  the  approach  where  each  predic¬ 
tor’s  feature  set  is  generated  by  randomly  drawing,  with 
replacement,  N  features  -  where  N  is  the  size  of  the 
original  feature  set. 
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ABSTRACT 


Three  similarity  spaces  were  used  in  the  selection  of  analogues  and  /f-ncarest- 
neighbor  (KNN)-based  estimation  of  normal  boiling  points  for  a  diverse  set  of  2926 
chemicals.  The  similarity  spaces  consisted  of  principal  components  derived  from  (1 ) 
40  topostructural  indices,  (2)  61  lopochemical  parameters,  and  (3)  the  full  set  of  101 
topostructural  and  topochemical  indices.  The  three  methods  selected  sets  of  analogues 
with  a  substantial  number  of  structurally  analogous  molecules.  For  the  KNN  method 
of  property  estimation,  the  similarity  space  that  used  the  full  set  of  indices  was  superior 
to  either  of  the  subsets  (topostructural  or  topochemical).  For  all  three  methods,  K  = 
6-10  gave  the  best  estimated  values  for  boiling  point. 


I.  INTRODUCTION 

Interest  in  quantifying  the  similarity  of  molecules  using  computational  methods  has 
increased. In  particular,  a  recent  trend  in  the  characterization  of  similarity/ 
dissimilarity  of  chemicals  makes  use  of  graph  invariants.  Molecular  structures  can 
be  represented  by  planar  graphs,  G  =  [V,E\,  where  the  nonempty  set  V  represents 
the  set  of  atoms  and  the  set  E  generally  represents  covalent  bonds. ^  These  graphs 
can  be  used  to  adequately  represent  the  pattern  of  connectedness  of  atoms  within  a 
molecule.  Graph  invariants,  values  derived  from  planar  graphs,  are  graph  theoretic 
properties  which  are  identical  for  isomorphic  graphs.  A  numerical  graph  invariant 
or  topological  index  maps  a  chemical  structure  into  the  set  of  real  numbers. ' 

Various  graph  invariants  have  been  used  in  ordering  and  partial  ordering  of  sets 
of  molecules.**^"®  Various  topological  indices  (TIs)  and  principal  components  (PCs) 
derived  from  TIs  have  been  used  in  quantifying  the  similarity/dissimilarity  of 
molecules  and  in  the  similarity-based  estimation  of  physical  and  toxicological 
properties.^-^-’^*^  Such  TIs  include  those  derived  from  simple  planar  graphs  which 
contain  adjacency  and  distance  information  for  vertices.  These  TIs  could  be 
considered  topostructural  indices.  Other  TIs,  which  are  derived  from  weighted 
chemical  graphs,  could  be  regarded  as  topochemical  indices  because  they  contain 
explicit  information  regarding  the  chemical  nature  of  the  atoms  (vertices)  and  bonds 
(edges)  in  the  molecular  structure,  in  addition  to  quantifying  the  adjacency  and 
distance  relationships  within  the  graph. 

Our  earlier  studies  made  use  of  a  combination  of  topostructural  and  topochemical 
indices  to  select  analogues  of  chemicals  and  estimate  properties  of  molecules  in 
large  and  diverse  databases  using  the  /C-nearest-neighbor  (KNN)  method.  In  this 
paper  we  have  carried  out  a  comparative  analysis  of  similarity-based  analogue 
selection  and  KNN-based  estimation  of  normal  boiling  point  using:  (1)  a  set  of  40 
topostructural  indices,  (2)  a  group  of  6 1  topochemical  indices,  and  (3)  the  combined 
set  of  101  indices. 
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II.  METHODS 

A.  Database 

The  normal  boiling  point  database  consisted  of  2926  compounds  taken  from  the 
U  S  EPA  aster'®  system.  The  data  comprised  a  set  for  which  chemical  structures 
and  normal  boiling  values  were  available,  and  for  which  it  was  possible  to  compute 
all  101  TIs. 

B.  Calculation  of  Indices 

The  TIs  calculated  for  this  study  are  listed  in  Table  1  and  include  Wiener 
number,”  molecular  connectivity  indices  as  calculated  by  Randid  and  Kiei^  an 
Hall,^'  frequency  of  path  lengths  of  varying  size,  information  theoretic-indices 
defined  on  distance  matrices  of  graphs  using  the  methods  of  Bonchev  arid  Trina- 
jstic“  as  well  as  those  of  Raychaudhury  et  al..“  parameters  defined  on  the 
neighborhood  complexity  of  vertices  in  hydrogen-filled  molecular  graphs,  a" 
Balaban’s  J  indices.'’’^  The  majority  of  the  TIs  were  calculated  using  POLLY 
2.3 The  J  indices  were  calculated  using  software  developed  by  the  authors. 

The  Wiener  index  (W).  the  first  topological  index  reported  in  the  chemical 
literature  ”  may  be  calculated  from  the  distance  matrix  D(G)  of  a  hydrogen- 
suppressed  chemical  graph  C  as  the  sum  of  the  entries  in  the  upper  tnangular 
distance  submatrix.  The  distance  matrix  D(G)  of  a  nondirected  graph  G  with  n 
vertices  is  a  symmetric  nxn  matrix  (d,p,  where  d^j  is  equal  to  the  distance  between 
vertices  v-  and  v.  in  G.  Each  diagonal  element  d,  of  D(G)  is  zero.  We  give  below 
the  distailce  matrix  D(G,)  of  the  unlabeled  hydrogen-suppressed  graph  G,  of 
«-propanol  (Figure  1): 


(1)  (2)  (3)  (4) 


W  is  calculated  as 

W=l/2'^d,j  =  '^h*g,, 

ij  h 

where  g,  is  the  number  of  unordered  pairs  of  vertices  whose  distance  is  h.  Thus,  for 
D(G,).  W  has  a  value  of  ten. 

RandiC’s  connectivity  index,^°  and  higher  order  connectivity  path,  cluster,  pat 
cluster,  and  chain  types  of  simple,  bond  and  valence  connectivity  parameters  were 
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Table  /.  Symbols,  Definitions,  and  Classifications  of  Topological  Parameters 

Topostructural 

Iq  Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of 

veaices  of  a  graph 

Mean  information  index  for  the  magnitude  of  distance 
W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of  a 

graph 

P  Degree  complexity 

Graph  vertex  complexity 
Graph  distance  complexity 

TT  Information  content  of  the  distance  matrix  partitioned  by  frequency  of 

occurrences  of  distance  h 

O  Order  of  neighborhood  when  /C^  reaches  its  maximum  value  for  the  hydrogen- 

filled  graph 

Ml  A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

M2  A  Zagreb  group  parameter  =  sum.  of  cross-product  of  degrees  over  all  neighboring 

(conneaed)  vertices 

Path  connectivity  index  of  order  h  =  (3-6 
Cluster  connectivity  index  of  order  h  =  3-6 
^Xpc  Path-cluster  connectivity  index  of  order  h  =  4-6 
^Xq^^  Chain  conneaivity  index  of  order  h  =  3-6 

P/,  Number  of  paths  of  length  h  =  0-1 0 

J  Balaban's  /  index  based  on  distance 

Topochemicaf 

/qrb  Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  us 
maximum  neighborhood  of  vertices 

iCf  Mean  information  content  or  complexity  of  a  graph  based  on  the  rth  (r  =  0-6) 

order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 
SiCr  Struaural  information  content  for  rth  (r  =  0-6)  order  neighborhood  of  vertices  m  a 

hydrogen-filled  graph 

OCr  Complementary  information  content  for  rth  (r  =  0-6)  order  neighborhood  of 
vertices  in  a  hydrogen-filled  graph 
Bond  path  conneaivity  index  of  order  h  =  0-6 
Bond  cluster  conneaivity  index  of  order  h  =  3-6 
^A^h  Bond  chain  conneaivity  index  of  order  h  =  3-6 

^A^  Bond  path-<lusier  connectivity  index  of  order  h  =  4-6 

^A^  Valence  path  conneaivity  index  of  order  h  =  0-6 

^A^  Valence  cluster  conneaivity  Index  of  order  h  -  3-6 

^A^h  Valence  chain  conneaivity  index  of  order  h  =  3-6 

^A^  Valence  path-cluster  conneaivity  index  of  order  h  =  4-6 

/  Balaban's  /  index  based  on  bond  types 

/  Balaban's  )  index  based  on  relative  electronegativities 

Balaban's  j  index  based  on  relative  covalent  radii 
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(1)  (2)  (3)  (4) 


G, 

Figure  1.  The  unlabeled  hydrogen-suppressed  graph  (Ci)  of  n-propanol. 


calculated  using  the  method  of  Kier  and  Hall.^'  The  generalized  form  of  the  simple 
path  connectivity  index  is  as  follows; 


■  •  Vi) 


,-1/2 


(2) 


paths 

where  v .  v- . v*  are  the  degrees  of  the  vertices  in  the  path  of  length  h.  The  path 

length  parameters  (P/,),  number  of  paths  of  length  /i  (/i  -  0,  1,  ....  10)  in  the 
hydrogen-suppressed  graph,  are  calculated  using  standard  algorithms. 

Information-theoretic  TIs  are  calculated  by  the  application  of  information  theory 
on  chemical  graphs.  An  appropriate  set  A  of /i  elements  is  derived  from  a  molecular 
graph  G  depending  on  certain  structural  characteristics.  On  the  basis  of  an  equiva¬ 
lence  relation  defined  on  A,  the  set  A  is  partitioned  into  disjoint  subsets  A,  of  order 
=  1, 2 . /i;  I/M,  =  rt).  A  probability  distribution  is  then  assigned  to  the  set  ot 

equivalence  classes: 

A,,  ,  A^ 


PvP2 . Ph 

where  p,  =  n/n  is  the  probability  that  a  randomly  selected  element  of  A  will  occur 

in  the  ith  subset.  ^  c-.  . 

The  mean  information  content  of  an  element  of  A  is  defined  by  Shannon  s 

relation:^* 


The  logarithm  is  taken  at  base  2  for  measuring  the  information  content  in  bits.  The 
total  information  content  of  the  set  A  is  then  n  x  IC. 

To  account  for  the  chemical  nature  of  vertices  as  well  as  their  bonding  pattern. 
Sarkar  et  al.’^  calculated  the  information  content  of  chemical  graphs  on  the  basis 
of  an  equivalence  relation  where  two  atoms  of  the  same  element  are  considered 
equivalent  if  they  possess  an  identical  first-order  topological  neighborhood.  Since 
properties  of  atoms  or  reaction  centers  are  often  modulated  by  stereoelectronic 
characteristics  of  distant  neighbors,  i.e..  neighbors  of  neighbors,  it  was  deemed 
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essential  to  extend  this  approach  to  account  for  higher  order  neighbors  of  vertices. 
This  can  be  accomplished  by  defining  open  spheres  for  all  vertices  of  a  chemical 
graph.  If  r  is  any  nonnegative  real  number  and  v  is  a  vertex  of  the  graph  C,  then  the 
open  sphere  5(v.  r)  is  defined  as  the  set  consisting  of  all  vertices  in  G  such  that 
div,v.)  <  r.  Therefore.  5(  v.O)  =  ()),  S{v,r)  =  v  for  0  <  r  <  1,  and  5(v,  r)  is  the  set  consisting 
of  V  and  all  vertices  v-  of  G  situated  at  unit  distance  from  v,  if  1  <  r  <  2. 

One  can  construct  such  open  spheres  for  higher  integral  values  of  r.  For  a 
particular  value  of  r,  the  collection  of  all  such  open  spheres  5(v.r),  where  v  runs 
over  the  whole  vertex  set  V,  forms  a  neighborhood  system  of  the  vertices  of  G.  A 
suitably  defined  equivalence  relation  can  then  partition  V  into  disjoint  subsets 
consisting  of  vertices  that  are  topologically  equivalent  for  rth-order  neighborhood. 
Such  an  approach  has  been  developed  and  the  information-theoretic  indices  calcu¬ 
lated  based  on  this  idea  are  called  indices  of  neighborhood  symmetry.^^ 

In  this  method,  chemicals  are  symbolized  by  weighted  linear  graphs.  Two  vertices 
Uq  and  Vq  of  a  molecular  graph  are  said  to  be  equivalent  with  respect  to  r-th-order 
neighborhood  if  and  only  if  corresponding  to  each  path  Mq,  m  of  length  r, 

there  is  a  distinct  path  v^,  v, . of  the  same  length  such  that  the  paths  have 

similar  edge  weights,  and  both  Mq  *’o  connected  to  the  same  number  and  type 
of  atoms  up  to  the  rth-order  bonded  neighbors.  The  detailed  equivalence  relation 

26  33 

has  been  described  in  earlier  studies.  ’ 

Once  partitioning  of  the  vertex  set  for  a  particular  order  of  neighborhood  is 
completed,  /C^  is  calculated  by  Eq.  2.  Basak  et  al.  defined  another  information- 
theoretic  measure,  structural  information  content  (5/C^),  which  is  calculated  as 

5/C,-/C/log2n  (4) 

where  IC  is  calculated  from  Eq.  2  and  n  is  the  total  number  of  vertices  of  the 
graph 

Another  information-theoretic  invariant,  complementary  information  content 
(C/C^),  is  defined  as 

C/C,  =  log2n-/C,  (5) 

CIC^  represents  the  difference  between  maximum  possible  complexity  of  a  graph 
(where  each  vertex  belongs  to  a  separate  equivalence  class)  and  the  realized 
topological  information  of  a  chemical  species  as  defined  by  /C^.^ 

In  Figure  2,  the  calculation  of  SIC^.  and  CIC2  is  demonstrated  for  the  labeled 
hydrogen-filled  graph  (Gj)  of  /i-propanol. 

The  information-theoretic  index  on  graph  distance,  is  calculated  from  the 
distance  matrix  D(G)  of  a  chemical  graph  G  as  follows:^^ 


Id  =  M'logz'V-  •  /ilogj/i 

h 


(6) 
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G2:  n-propanol 
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Figure  Z  Calculation  of  the  indices  IC2,  SIC2,  and  CIC2  for  the  hydrogen-filled, 
labeled  graph  (C2)  of  n-propanol. 
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The  mean  information  index.  7^,  is  found  by  dividing  the  information  index  ij  by 
W.  The  information-theoretic  parameters  defined  on  the  distance  matrix,  H  an 
were  calculated  by  the  method  of  Raychaudhury  et  al. 

Balaban  defined  a  series  of  indices  based  on  distance  sums  within  the  distance 
matrix  for  a  chemical  graph  which  he  designated  as  J  indices.  These  indices 
are  highly  discriminating  with  low  degeneracy.  Unlike  W,  the  J  indices  have  a  range 
of  values  that  is  independent  of  molecular  size.  The  general  form  of  the  J  index 
calculation  is  as  follows: 


iV.  edges 

where  the  cyclomatic  number  ^  (or  number  of  rings  in  the  graph)  is  p  =  <?  -  n  +  1 
with  q  edges  and  n  vertices,  and  r,.  is  the  sum  of  the  distances  of  atom  i  to  all  other 
atoms  and  r  is  the  sum  of  the  distances  of  atom;  to  all  other  atoms.  Variants  were 
proposed  by  Balaban  for  incorporating  information  on  bond  type,  relative  elec¬ 
tronegativities,  and  relative  covalent  radii. 

C.  Classification  of  the  Indices 

The  set  of  1 0 1  TIs  was  partitioned  into  two  distinct  subsets;  topostructural  indices 
and  topochemical  indices.  Topostructural  indices  encode  information  about  the 
adjacency  and  distances  of  atoms  (vertices)  in  molecular  structures  (graphs)  irre¬ 
spective  of  atom  type  or  factors  such  as  hybridization  states  and  number  of  core/ 
valence  electrons  in  individual  atoms.  Topochemical  indices  quantify  information 
regarding  specific  chemical  properties  of  the  atoms  comprising  a  molecule  as  well 
as  the  topology  (connectivity  of  atoms).  Topochemical  indices  are  derived  from 
weighted  molecular  graphs  where  each  vertex  (atom)  is  properly  weighted  with 
selected  chemical/physical  properties.  These  subsets  are  shown  in  Table  1. 

D.  Statistical  Methods  and  Computation  of  Similarity 
Data  Reduction 

Initially,  all  TIs  were  transformed  by  the  natural  logarithm  of  the  index  plus  one. 
This  was  done  since  the  scale  of  some  TIs  may  be  several  orders  of  magnitude 
greater  than  other  TIs. 

A  principal  component  analysis  (PCA)  was  used  on  the  transformed  indices  to 
minimize  intercorrelation  of  indices.  The  PCA  analysis  was  accomplished  using 
the  SAS  procedure  PRINCOMP.^''  The  PCA  produces  linear  combinations  of  the 
TIs,  called  principal  components  (PCs)  which  are  derived  from  the  correlation 
matrix.  The  first  PC  has  the  largest  variance,  or  eigenvalue,  of  the  linear  combina¬ 
tion  of  TIs.  Each  subsequent  PC  explains  the  maximal  index  variance  orthogonal 
to  the  previous  PCs,  eliminating  any  redundancies  that  could  occur  within  the  set 
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of  TIs.  The  maximum  number  of  PCs  generated  is  equal  to  the  number  of  Tls 
available.  For  the  purposes  of  this  study,  only  PCs  with  eigenvalues  greater  *an 
one  were  retained.  A  more  detailed  explanation  of  this  approach  has  been  provided 
in  a  previous  study  by  Basak  ct  ah'*  These  PCs  were  subsequently  used  in  determin- 
ing  similarity  scores  as  described  below. 

Similarity  Measures 

Intermolecular  similarity  was  measured  by  the  Euclidean  distance  {ED)  within 
an  n-dimensional  space.  This  «-dimensional  space  consisted  of  orthogonal  vari¬ 
ables  (PCs)  derived  from  the  TIs  as  described  above.  ED  between  molecules  i  and 
j  is  defined  as 


BD,r 

where  n  is  the  number  of  dimensions  or  PCs  retained  from  the  PCA.  and  Dj^ 
are  the  data  values  of  the  itth  dimension  for  chemicals  i  and;,  respectively. 

K-Nearest-Neighbor  Selection  and  Property  Estimation 

Following  the  quantification  of  intermolecular  similarity  of  the  2926  chemicals, 
the  Af-nearest  neighbors  (K  =1-10. 15. 20, 25)  were  determined  on  the  basis  of  ED. 
This  procedure  can  be  used  to  select  structural  analogues  (neighbors)  of  a  probe 
compound  or  the  neighbors  can  be  used  in  property  estimation.  In  estimating  the 
normal  boiling  point  of  the  probe  compound,  the  mean  observed  normal  boiling 
point  of  the  /(-nearest  neighbors  was  used  as  the  estimate  and  the  standard  error  (r) 
of  the  estimate  was  used  to  assess  the  efficacy  of  the  set  of  indices. 

III.  RESULTS 

A.  Principal  Component  Analysis 

From  the  PCA  of  the  40  topostructural  indices,  seven  PCs  with  eigenvalues 
greater  than  one  were  retained.  These  seven  PCs  explained,  cumulatively.  90.8% 
of  the  total  variance  within  the  TI  data.  Table  2  lists  the  eigenvalues  of  the  seven 
PCs.  the  proportion  of  variance  explained  by  each  PC,  the  cumulative  variance 
explained,  and  the  three  Tls  most  correlated  with  each  individual  PC. 

The  PCA  of  the  61  topochemical  indices  resulted  in  the  selection  often  PCs,  all 
having  eigenvalues  greater  than  one.  The  ten  PCs  explain  a  total  of  92.1%  of  the 
variance  within  the  Tl  data.  Table  3  presents  a  summary  of  the  information 
regarding  these  ten  PCs. 


k=\ 


(8) 


Table  2.  Summary  of  Principal  Component  Analysis  of  40  Topostructural  Indices 

for  2926  Chemicals 


Eigenvalue 

Proportion  of 
Explained 
Variance 

Cumulative 

Explained 

Variance 

Top  Three 
Correlated  Indices 

28.2 

46.2 

46.2 

P„Po.'^ 

n.o 

18.0 

64.3 

^ApC,  ^XpQ,  ^XpQ 

5.9 

9.6 

73.9 

^Xo  ^Xc.  "Apc 

4.1 

6.7 

80.6 

1.  ‘Ach,  “Ac 

2.8 

4.6 

85.2 

“Ach.  *Ach,  ^Aich 

1.9 

3.1 

88.3 

^Ach-  “Ach,  ^Ach 

1.5 

2.4 

90.8 

Pio-  P9 

Table  3.  Summary  of  Principal  Component  Analysis  of  61  Topochemical  Indices 

for  2926  Chemicals 


Eigenvalue 

20.4 

10.8 

8.1 

6.1 

3.0 

2.4 
1.7 

1.4 
1.2 
1.1 


Proportion  of 
Explained 
Variance 


Cumulative 

Explained 

Variance 


Top  Three 
Correlated  Indices 

S/C4,  S/Cj,  S/C5 

ICo.  SIQ,  ICy 
*’Xt 

■•a^,  ^A',  ‘■aj 

^X'c.  ’’Xc,  ■‘A^ 

^A^,  “A'r,  '‘Xir 


Table  4.  Summary  of  Principal  Component  Analysis  of  1 01  Topological  Indices 

for  2926  Chemicals 


Eigenvalue 

42.6 

13.3 

11.4 
8.9 

5.1 
3-7 
2.6 
2.0 
1-7 
1.4 

1.1 
1.0 


Proportion  of 
Explained 
Variance 


Cumulative 

Explained 

Variance 


Top  Three 
Correlated  Indices 


SICs,  C/Ce 

‘A^h 

I.  “Xch. 

ICo.  SICo,  SICy 
^A^,  ^Ac,  ’A^ 
"A^,  5A^,  ‘A^h 
“A^,  /Co,  S/Co 
^AJ,  "AJc.  ^Xt 
/C,./.  /Co 
P9.  Pio-  Ps 
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Twelve  PCs  were  retained  from  the  PCA  of  the  full  set  of  101  TIs*  Each  of  these 
PCs  had  an  eigenvalue  greater  than  one  and,  cumulatively,  they  explained  92.8% 
of  the  variance  within  the  full  set  of  TIs.  These  PCs  are  summarized  in  Table  4. 


Probe;  3-methyl-4-chlofOpheno( 


All 


(1)  0.01 


(2)  0.02  (3)  0.02  (4)  0.03  (5)  0.03 


Figure  J.  The  five  analogues  selected  for  the  probe  3-methyl-4-chlorophenol  using 
three  molecular  similarity  spaces:  topostructural,  topochemical,  and  all  indices.  The 
numbers  under  the  structures  indicate  the  ranking  of  the  analogues  and  the  Euclidean 
distance  to  the  probe. 
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Table  5.  Comparison  of  the  Three  Sets  of  TIs  and  Their  Derivative  PCs  for 
Prediction  of  Normal  Boiling  Point  rC)  Using  K-Nearest-Neighbors  (n  =  2926) 

Indices  ^  ^ 


Topostruaural 

Topochemical 

Topostructural  +  topochemical 


10 

0.881 

39.0 

6 

0.883 

38.6 

8 

0.896 

36.6 

Figure  4.  Pattern  of  (top)  correlation  (r)  and  (bottom)  standard  error  (s)  of  the  estimates 
according  to  the  K-nearesl-neighbor  selection  for  2926  normal  boiling  points  using 
three  molecular  similarity  spaces. 
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B.  Analogue  Selection 

Figure  3  shows  an  example  of  analogue  selection  using  PCs  to  derive  a  Euclidean 
distance  space.  The  first  five  analogues  (neighbors)  for  the  probe  compound, 
3-methyl-4-chlorophenol.  are  presented  for  each  of  the  three  similarity  spaces.  The 
analogues  selected  by  the  topostructural  model  show  a  repetition  of  the  same 
skeletal  structure,  ignoring  substituents,  throughout  the  first  five  ana  ogues.  In  the 
topochemical  model  and  the  full  set  model  some  variability  in  the  skeletal  structure 
arises  (chemical  analogues  2  and  5,  full  set  analogue  4).  Also  of  interest  is  the 
repetition  of  chemicals  between  the  sets  of  analogues.  While  the  ordering  varies 
between  the  methods,  the  topostructural  and  topochemical  models  select  two 
identical  structures,  the  topostructural  and  the  full  set  have  three  analogues  in 
common,  and  the  topochemical  and  full  set  select  four  of  the  same  analogues. 
2-Chloro-5-methylphenol  appears  in  all  three  sets,  while  there  are  only  three  unique 
compounds  (topostructural  analogues  4  and  5,  topochemical  analogue  5). 

C.  K-Nearest-Neighbor  Property  Estimation 

Figure  4  presents  the  correlation  (r)  and  the  standard  error  (s)  of  the  pr^iction 
of  the  normal  boiling  points  for  the  2926  chemicals  for  the  three  groups  of  indices 

overrS  '-'<>•  '’• 

best  normal  boiling  point  model  for  each  set  of  indices.  The  best  boiling  point 
estimates  for  all  three  sets  were  for  K  in  the  range  of  6  to  10.  The  full  set  of  indices 
gave  the  best  result,  although  there  was  only  a  small  difference  between  models. 


IV.  DISCUSSION 

The  purpose  of  this  paper  was  to  study  the  relative  effectiveness  of  three  similarity 
spaces  derived  from  graph  invariants  in  the  selection  of  structural  analogues  and  m 
the  KNN-based  estimation  of  properties.  The  similarity  spaces  were  create^  using 
a  PCA  of  calculated  graph  invariants.  Tables  2-4  summarize  the  results  of  the  PCA 
of  the  three  sets  of  indices.  The  first  PC  is  always  correlated  with  indices  that 
quantify  molecular  size.  In  the  case  of  the  topostructural  indices,  the  second  PC  is 
most  correlated  with  branching  indices.  In  the  case  of  PCs  derived  from  eidier 
topochemical  or  the  full  set  of  topostructural  and  topocheimcal  paraimters,  the  fust 
PCwas  strongly  correlated  with  molecular  size,  while  the  second  PC  was  highly 
associated  with  the  molecular  complexity  '"^j^w.^'^ese  results  are  in  line  with  our 

earlier  studies  on  different  sets  of  chemicals. 

All  three  spaces  were  used  in  the  selection  of  five  analogues  of  a  particular 
structure  (Figure  3).  Perusal  of  the  three  sets  of  structures  shows  that  there  is  a 
substantial  degree  of  similarity  among  the  three  groups  of  five  chemicals  sel^t^. 
It  is  interesting  to  note  that  all  five  nearest  neighbors  of  the  probe  selected  by  the 
topostructural  method  had  isomorphic  skeletal  graphs  when  hydrogen  atoms  are 
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suppressed.  For  the  two  similarity  spaces  created  by  topochemical  indices  alone 
and  the  combined  set  of  lopostructural  and  topochemical  indices,  four  of  the  five 
selected  neighbors  are  common  (Figure  3)  although  the  ordering  of  the  molecules 
is  different.  This  shows  that  these  two  similarity  methods  are  not  intrinsically  very 
different.  Our  earlier  results  showed  that  analogues  selected  by  similarity  methods 
derived  from  experimental  physical  properties,  atom  pairs,  and  TIs  select  very 
similar  sets  of  analogues. 

In  the  case  of  KNN-based  estimation  of  boiling  points  of  chemicals  from  their 
analogues,  K  was  varied  from  1  to  25.  The  best  estimated  value  was  obtained  in  the 
range  of  =  6-10.  This  is  in  line  with  our  earlier  studies  with  different  proper- 
ties.*‘*^2 

In  conclusion,  the  three  similarity  spaces  derived  in  this  paper  have  reasonable 
power  for  selecting  analogous  molecules  from  a  very  diverse  database  of  chemicals. 
The  KNN-based  estimation  shows  that  selected  analogues  can  be  used  for  the 
estimation  of  boiling  points  of  diverse  chemicals  if  more  accurate  methods  are  not 
available. 
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Appendix  1.12  Prediction  of  the  dermal  penetration  of 

polycyclic  aromatic  hydrocarbons  (PAHs) 
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Auemms  were  made  to  develop  hierarchical  quanuial.ve  structure-aclivity  rcUUonship  (QSAR) 
Ss"  foT  the  denna.  penetration  of  polycycl.c  aromatic  hydrocarbons  (PAHs)  usm^^  four 
classes  of  theoretical  structural  parameters;  vi:.,  topostructural.  topocheinical  geometnc  and 
Quantum  chemical  descriptors;  and  physicochemical  properties  such  as  molecular  weight  (  ) 

and  hpophilicity  (logP- octanol/water).  The  results  show  that 

and  geometric  Lscrfptors  and  molecular  weight  are  equally  effective  in  predicting  ‘he  dertnal 
penetration  of  PAHs.  Quantum  chemical  parameters  did  not  make  any  improvements 
predictive  power  of  the  QSAR  models. 


Kev^vords  Hierarchical  QSAR;  topological  indices;  geometnjal  indices;  quantum  chemical 
parameters;  dermal  penetration;  polycyclic  aromatic  hydrocarbons 


INTRODUCTION 

An  understanding  of  the  barrier  properties  of  skin  is  important  both  (or 
hazard  assessment  following  dermal  exposure  to  toxicants  [1]  as  well  as  or 
the  transdermal  delivery  of  drugs  12).  Over  the  years  transdermal  delivery 
data  on  a  large  number  of  compounds  have  been  accumulated.  These 
compounds  cover  a  wide  range  of  physicochemical  properties  and  structural 
types  (1).  Attempts  have  been  made  to  explain  permeation  behavior  of 
chemicals  using  specific  models  of  the  permeability  barrier. 


•Corresponding  author. 


I 


2 


B  D.  GUTE  et  al. 


One  of  the  contemporary  interests  in  the  field  is  the  prediction  of  skin 
permeability  from  their  physicochemical  and  structural  parameters.  Potts 
and  Guy  [1]  and  Guy  [3]  succeeded  in  predicting  the  permeability  coefficient 
of  diverse  chemicals  using  molecular  weight  (MW),  molar  volume  (MV)  and 
octanol/water  partition  coefficient.  These  parameters  quantify  size  and  hydro- 
phobicity  of  chemicals.  Molnar  and  King  used  integrated  molecular  trans¬ 
form,  FTnj,  as  the  structural  parameter  for  predicting  skin  permeability  of 
diverse  chemicals  [4]. 

A  recent  interest  in  quantitative  structure-activity  relationship  (QSAR) 
studies  is  the  prediction  of  toxicological  and  pharmacological  properties  of 
chemicals  directly  from  their  structure  [5-12].  This  is  particularly  important 
for  the  risk  assessment  of  chemicals  where  the  majority  of  the  new  chemicals 
which  have  little  or  no  available  experimental  data  [13]. 

Recently  we  have  developed  a  new  hierarchical  approach  to  QSAR 
using  parameters  which  can  be  computed  directly  from  molecular  structure 
[14-18].  Such  variables  include  topostructural,  topochemical,  geometrical 
and  quantum  chemical  parameters.  These  parameters  quantify  size,  shape,  and 
stereo-electronic  aspects  of  molecular  architecture.  In  view  of  the  fact  that 
well-known  molecular  properties  like  molecular  weight,  octanol/water  parti¬ 
tion  coefficient,  molar  volume  and  calculated  molecular  descriptors  like  inte¬ 
grated  molecular  transform  have  been  used  in  predicting  skin  permeability  of 
chemicals,  it  was  of  interest  to  investigate  our  hierarchical  approach  in 
estimating  skin  permeability.  To  this  end,  we  have  attempted  to  predict  the 
skin  permeability  of  a  set  of  sixty  polycyclic  aromatic  hydrocarbons  using  the 
hierarchical  QSAR  method. 

THEORETICAL  METHODS 
Database 

A  data  set  of  sixty  polycyclic  aromatic  hydrocarbons  (PAHs)  was  used  for 
the  development  of  hierarchical  QSAR  models.  The  data  was  taken  from  the 
work  of  Roy  et  al.  [19].  Using  equimolar  concentrations  for  each  compound, 
dermal  penetration  (%DP)  was  determined  24-hours  after  dosing.  Activity 
was  expressed  as  the  percentage  of  the  applied  dose  (40  nmoles  per  cm^  skin 
surface)  which  penetrated  the  skin.  The  molecular  structures  of  the  PAHs 
were  coded  for  evaluation  using  the  SMILES  line-notation  for  chemical 
structure  [20].  This  data:  including  compound  name.  Chemical  Abstracts 
Services  (CAS)  registry  number  (when  available),  and  measured  dermal 
penetration;  are  presented  m  Table  I. 
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TABLE  I  Sixty  polycyclic  aromatic  hydrocarbons  (PAHs)  and  their  dermal  penetration  values 
expressed  as  percent  of  biological  activity 


No. 

Compound 

CAS  No. 

Act. 

Pred. 

Act. 

Resid. 

1 

Coronene 

191-07-1 

0.70 

7.18 

-6.48 

2 

Dibenzo(a,  /)pyrene 

191-30-0 

2.00 

7.08 

-5.08 

3 

9, 1 0-Diphenylanthracene 

1499-10-1 

6.00 

-0.10 

6.10 

4 

Perylcne 

198-55-0 

7.00, 

19.68 

-12.68 

5 

Dibenzo(a,  Opyrene 

189-55-9 

8.00 

7.18 

0.82 

6 

3*Mcthylcholanthcne 

56-49-5 

8.00 

11.89 

-3.89 

7 

Benzylhydrilindenefluorene 

1836-87-9 

8.00 

16.93 

-8.93 

8 

7, 10-Dimcthylben2o(a)pyrene 

63104-33-6 

8.30 

11.57 

-3.27 

9 

1  ndeno(  1 ,2,3  :c,</  )pyrene 

193-39-5 

9.00 

11.52 

-2.52 

10 

Dibenz(fl.  A)anihraccne 

53-70-3 

9.40 

13.29 

-3.89 

11 

Benzo(e)pyrcnc 

192-97-2 

10.00 

19.68 

-9.68 

12 

Bcnzo(g,  A,  Operylene 

191-24-2 

10.00 

13.19 

-3.19 

13 

9-p-Tolylfluorene 

1815-43-0 

10.00 

14.97 

-4.97 

14 

6-Ethylchrysene 

2732-58-3 

10.00 

16.51 

-6.51 

15 

9-Cinnamylfiuorenc 

NA 

11.00 

8.08 

2.92 

16 

6-Mcthylbcnz(a)anthraccnc 

316-14-3 

14.00 

22.40 

-8.40 

17 

Bcnzo(it)fluoranihene 

207-08-9 

14.00 

17.99 

-3.99 

18 

Benzo(a)pyrcne 

50-32-8 

15.00 

19.79 

-4.79 

19 

l-Ethylpyrene 

17088-22-1 

18.00 

23.43 

-5.43 

20 

l-Mclhyl-7-isopropylphenanthrene 

483-65-8 

20.00 

21.95 

-1.95 

21 

2-rerr-Butylanthracene 

18801-00-8 

20.00 

23.28 

-3.28 

22 

9-Phcnylanlhraccnc 

602-55-1 

20.00 

18.78 

1.22 

23 

3- Methyl  benzo(c)phenanthrene 

56-49-5 

20.00 

11.89 

8.11 

24 

10-Mcthylbenz(fl)anihracene 

2381-15-9 

20.00 

22.49 

-2.49 

25 

5-Mcthyibcnz(a)anihracene 

2319-96-2 

20.00 

22.40 

-2.40 

26 

9, 1 0-Dihydroanthraccne 

613-31-0 

20.00 

37.63 

-17.63 

27 

9-Phcnylfluorene 

789-24-2 

20.00 

19.07 

0.93 

28 

1 ,2,3,6,7,8-Hexahydropyrene 

1732-13-4 

20.00 

22.00 

-2.00 

29 

n-Butylpyrcne 

35980-18-8 

20.00 

13.27 

6.73 

30 

5.6-Dihydro-4H-dibenz 

(a,/:,/)anlhracenc 

7198-87-0 

20.00 

11.09 

8.91 

31 

3-Eihylfluoranihcne 

20496-16-6 

20.00 

21.42 

-1.42 

32 

Triphcnylene 

217-59-4 

20.00 

26.77 

-6.77 

33 

7,8,9,  lO-Tctrahydroaccphenanthrene 

7468-93-1 

20.00 

22.03 

-2.03 

34 

2,3-Bcnztriphenylene 

215-58-7 

20.00 

13.19 

6.81 

35 

Bcnzo(c)phenanthrene 

195-19-7 

20.00 

26.89 

-6.89 

36 

1-Mcthylpyrcne 

2381-21-7 

22.00 

29.76 

-7.76 

37 

3.9-Dimclhylbcnz(fl)anthracene 

316-51-8 

24.00 

18.22 

5.78 

38 

2,3-Bcnzofluorcne 

243-17-4 

25-00 

27.26 

-2.26 

39 

1 ,2-Bcnzofluorene 

238-84-6 

25.00 

27.17 

-2.17 

40 

9-Bcnzylfluorcnc 

1572-46-9 

26-00 

14.36 

11.64 

4! 

9-m-Toylfluorene 

18153-42-9 

29.00 

14.97 

14.03 

42 

Pyrene 

129-00-0 

30.00 

34.84 

-4.84 

43 

2-Eihylanthracene 

52251-71-5 

30.00 

31.11 

-l.ll 

44 

1 0-Methylbcn2o(fl)pyrene 

63104-32-5 

32.00 

15.58 

16.42 

45 

I  -Mcihylanlhraccne 

610-48-0 

33.00 

37.99 

-4.99 

46 

2-Methylfluoranthene 

33543-31-6 

33.00 

27.67 

5.33 

47 

3,6-Dimethylphenanthrene 

1576-67-6 

33.00 

32.78 

0.22 

48 

Bcnzo(a)anthracene 

56-55-3 

35.00 

27.02 

7.98 

49 

Fluorcnc 

86-73-7 

36.00 

43.80 

-7.80 

50 

2-Meihylphcnanthrenc 

2531-84-2 

38.00 

37.96 

0.04 

51 

9-Ethylfluorcnc 

2294-82-8 

38.00 

31.06 

6.94 
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TABLE  1  (Continued) 


No. 

Compound 

CAS  No. 

Act. 

Fred. 

Act. 

Resid. 

52 

1-Melhylphenanihrcne 

832-69-9 

40.00 

37.85 

2-15 

53 

9,10-Dihydrophenanlhrene 

776-35-2 

40.00 

37.07 

2.93 

54 

9-Vinylanthracene 

2444-68-0 

40.00 

35.37 

4.63 

55 

Anthracene 

120-12-7 

42.00 

43.66 

- 1.66 

56 

Fluoranthene 

206-44-0 

42.00 

32.52 

9.48 

57 

l-Methylfluorcne 

1730-37-6 

49.00 

38.16 

10.84 

58 

2-Meihylanlhracene 

613-12-7 

50.00 

38.11 

11.89 

59 

4H-Cyclopenia(^/.  e ,  /)phenanthrene 

203-64-5 

50.00 

36.23 

13.77 

60 

Phenanthrene 

85-01-8 

50.00 

43.50 

6.50 

Computation  of  Indices 

Five  sets  of  parameters  have  been  used  to  construct  the  hierarchical  models 
presented  in  this  study.  These  sets  include  topostructural,  topochemical, 
geometric,  quantum  chemical,  and  physicochemical  descriptors.  Topo¬ 
structural  and  topochemical  indices  are  subsets  of  the  set  of  topological 
indices,  and  the  distinction  between  these  groups  will  be  discussed  later. 
Geometric  indices  include  the  three-dimensional  Wiener  number,  both 
hydrogen-filled  and  hydrogen-suppressed,  and  van  der  Waals  volume.  The 
quantum  chemical  parameters  were  calculated  using  four  semi-empirical 
Hamiltonians,  and  the  physicochemical  descriptors  include  calculated  log  P 
and  molecular  weight,  These  physicochemical  indices  were  included  since 
they  are  commonly  used  in  modeling  dermal  penetration.  The  set  of  indices 
used  in  this  study  are  summarized  in  Table  II. 

TABLE  II  Classification  of  parameters  used  in  developing  models  for  the  dermal  penetration 
of  polycyclic  aromatic  hydrocarbons  (PAHs) 


Quantum 


Topostructural 

Topochemical 

Geometric 

chemical 

.w 

*D 

loRB 

Vw 

Ehomo 

in 

ICo-IC^ 

3Dw 

Ehomoi 

D 

w 

SICVSIC* 

Elumo 

ClCo-CIC^ 

Elumoi 

6x1- 

AHf 

’  xj' 

F 

1C 

'  &  ‘X"o 

O 

‘X'’p(-‘XV 

M, 

'’X'-*X' 

Mj 

'X'  &  'x; 

""x  -‘•X 

'X-,,  &'’X'ch 

V'  by' 

A, 6, 

'Xch  &  "X< , 

^Xpc'-**Xp< 

P 0’^  10 

J 
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Topological  Indices 

The  topological  indices  used  in  this  study,  both  the  topostructural  and  the 
topochemical,  have  been  calculated  using  POLLY  2.3  [21]  and  software 
developed  by  the  authors.  These  indices  include  Wiener  index  [22], 
connectivity  indices  developed  by  Randic  [23]  and  higher  order  connectivity 
indices  formulated  by  Kier  and  Hall  [24],  bonding  connectivity  indices 
defined  by  Basak  et  al.  [25],  a  set  of  information  theoretic  indices  defined  on 
the  distance  matrices  of  simple  molecular  graphs  [26,  27]  and  neighborhood 
complexity  indices  of  hydrogen-filled  molecular  graphs  [28,29],  and 
Balaban’s  J  indices  [30-32].  Table  III  provides  a  list  and  brief  definitions 


TABLE  III  Symbols,  definitions  and  classifications  of  topological  parameters _ 

Topostructural 

Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of 
vertices  of  a  graph 

Mean  information  index  for  the  magnitude  of  distance 

Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matnx  of  a 
graph 

Degree  complexity 
Graph  vertex  complexity 

Graph  distance  complexity  .  •  ^  r  r 

Information  content  of  the  distance  matrix  partitioned  by  frequency  of 

occurrences  of  distance  h  .  r  u  j 

Order  of  neighborhood  when  1C,  reaches  its  maximum  value  for  the  hydrogen- 

filled  graph  „ 

A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 

neighboring  (connected)  vertices 

Path  connectivity  index  of  order  h  ~  0-6 

Cluster  connectivity  index  of  order  /i  =  3  -  5 

Palh-clustcr  connectivity  index  of  order  h  -  4-6 

Chain  connectivity  index  of  order  h  -  5  &  6 

Number  of  paths  of  length  =  0  -  10 

Balaban’s  J  index  based  on  distance 

Topochemical 

,  Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 

maximum  neighborhood  of  vertices  ^  -  n  a\ 

Mean  information  content  or  complexity  of  a  graph  based  on  the  rth  (r  ^  U-b) 

order  neighborhood  of  vertices  in  a  hydrogen-filled  graph  ^  ^ 

Structural  information  content  for  rth  (r  =  0-6)  order  neighborhood  of  vertices 

in  a  hydrogen-filled  graph  uuu  a 

Complementary  information  content  for  r  th  (r  =  0  -  6)  order  neighborhood  of 

vertices  in  a  hydrogen-filled  graph 
Bond  path  connectivity  index  of  order  h  ~  0-6 
Bond  cluster  connectivity  index  of  order  /i  =  3  &  5 
^  Bond  chain  connectivity  index  of  order  h  =  S  Sc  6 

Bond  path-cluster  connectivity  index  of  order  h  -  4-6 _ _ _ 


6 


B.  D.  GUTE  et  al. 


TABLE  HI  (Continued) 

Valence  path  connectivity  index  of  order  /»  =  0-6 
Valence  cluster  connectivity  index  of  order  h  —  3  &  5 
*X'^Ch  Valence  chain  connectivity  index  of  order  A  =  5  &  6 

^X^  Valence  path-cluster  connectivity  index  of  order  /i  =  4-6 

y®  Balaban’s  J  index  based  on  bond  types 

Geometric 

Vw  van  der  Waal's  volume 

3-D  Wiener  number  for  the  hydrogen-suppressed  geometric  distance  matrix 
3-D  Wiener  number  for  the  hydrogen-filled  geometric  distance  matrix 


of  the  topostructuraK  lopochemical,  and  geometrical  indices  included  in  this 
study. 

The  topological  indices  were  divided  into  two  subsets:  topostructural  and 
topochemical  indices.  Topostructural  indices  (TSIs)  are  topological  indices 
which  only  encode  information  about  the  adjacency  and  distances  of  atoms 
(vertices)  in  molecular  structures  (graphs),  irrespective  of  the  chemical 
nature  of  the  atoms  involved  in  bonding  or  factors  such  as  hybridization 
states  and  the  number  of  core/valence  electrons  in  individual  atoms. 
Topochemical  indices  (TCIs)  are  parameters  that  quantify  information 
regarding  the  topology  (connectivity  of  atoms),  as  well  as  specific  chemical 
properties  of  the  atoms  comprising  a  molecule.  These  indices  are  derived 
from  weighted  molecular  graphs  where  each  vertex  (atom)  or  edge  (bond)  is 
prof)erly  weighted  with  selected  chemical  or  physical  property  information. 
The  division  of  the  topological  indices  into  the  sets  of  topostructural  and 
topochemical  indices  is  shown  in  Tables  II  and  III. 


Geometrical  Indices 

Van  der  Waals  volume,  [33-35],  was  calculated  using  Sybyl  6.1  from 
Tripos  Associates,  Inc  [36].  The  3-D  Wiener  numbers  were  calculated  by 
Sybyi  using  an  SPL  (Sybyl  Programming  Language)  program  developed  in 
our  laboratory  [37].  Calculation  of  3-D  Wiener  numbers  consists  of  the  sum 
entries  in  the  upper  triangular  submatrix  of  the  topographic  Euclidean 
distance  matrix  for  a  molecule.  The  3-D  coordinates  for  the  atoms  were 
determined  using  COSCORD  3.0.1  [38].  Two  variants  of  the  3-D  Wiener 
number  were  calculated:  ^^Wh  and  ^^W.  For  ^^Wh  hydrogen  atoms  are 
included  in  the  compulations  and  for  ^^W,  hydrogen  atoms  are  excluded 
from  the  computations. 
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Quantum  Chemical  Parameters 

Quantum  chemicals  parameters  were  calculated  using  four  semi-empirical 
Hamiltonian  methods:  modified  neglect  of  diatomic  overlap  version  1 
(MNDO),  modified  neglect  of  diatomic  overlap  Austin  Model  1  (AMI), 
modified  neglect  of  diatomic  overlap  parametric  method  3  (PM3),  and 
modified  intermediate  neglect  of  differential  overlap  version  3  (MINDO/3). 
The  following  quantum  chemical  parameters  were  calculated  using  each  of 
the  above  methods:  energy  of  the  highest  occupied  molecular  orbital 
(Ehomo)<  energy  of  the  second  highest  occupied  molecular  orbital 
(Ehomoi).  energy  of  the  lowest  unoccupied  molecular  orbital  (Elumo), 
energy  of  the  second  lowest  unoccupied  molecular  orbital  (Elumoi).  heat 
of  formation  (A///),  dipole  moment  (m),  and  HOMO/LUMO  gap  (Ehomo- 
Elumo)-  These  parameters  were  calculated  using  MOPAC  6.00  in  the  Sybyl 
interface  [39). 


Physicochemical  Descriptors 

Molecular  weight  (MW)  was  calculated  using  Sybyl  6.1.  Molecular  weight 
can  be  thought  of  as  a  descriptor  which  characterizes  the  general  size  of  a 
molecule,  especially  in  the  case  a  specialized  set  such  as  the  PAHs.  Values  of 
log  P  were  computed  by  CLOGP  [40].  The  calculated  values  of  log  P  for  the 
set  of  sixty  PAHs  range  from  approximately  4.2  to  8.3  and  are  presented  in 
Table  IV. 


Data  Reduction 

Initially,  all  topological  indices  were  transformed  by  the  natural  logarithm 
of  the  index  plus  one.  This  was  done  to  scale  the  indices,  since  some  may  be 
several  orders  of  magnitude  greater  than  others,  while  other  indices  may 
equal  zero.  The  geometric  indices  were  also  transformed  by  the  natural 
logarithm  of  the  index  for  consistency. 

The  resulting  set  of  eighty-eight  topological  indices  was  then  partitioned 
into  two  distinct  sets,  the  topostructural  indices  (thirty-eight)  and  the 
topochemical  indices  (fifty).  Further  reduction  of  the  number  of  indepen¬ 
dent  variables  available  for  model  construction  was  still  necessary  to 
minimize  the  chance  of  spurious  correlations.  According  to  the  guidelines 
described  by  Topliss  and  Edwards,  for  a  set  of  sixty  observations, 
approximately  thirty-five  independent  variables  can  be  used  in  modeling 
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TABLE  IV  Calculated  values  for  molecular  weight  (MW),  lipophilicity  (log  P). 

To  Tw  ITp  'po 


I  300.360  7.044 

302.376  7.298 

330.430  8.266 

252.316  6.124 

289.357  NA 

268.359  7.067 

254.332  5.858 

280.370  7.422 

276.338  6.584 

10  278.354  6.838 

II  252.316  6.124 

12  276.338  6.584 

13  256.348  6.432 

14  256.348  6.842 

15  282.386  6.916 

16  .242.321  6.313 

17  252.316  6.124 

18  252.316  6.124 

19  230.310  6.128 

20  234.342  6.716 

21  234.342  6.466 

22  254.332  6.378 

23  242.321  7.067 

24  242.321  6.313 

25  242  321  6  313 

26  180.250  4.674 

27  242.321  5  783 

28  208  3(M  5  942 

29  258  364  7  186 

30  268  359  6  977 

31  230  310  6  128 

32  228  294  5  664 

33  208  304  5  942 

34  278.354  6  838 

35  228.294  5  664 

36  216  283  5  599 

37  256.348  6  962 

38  216.283  5.399 

39  216.283  5  399 

40  256  348  6  312 

41  256  348  6  432 

42  202.256  4  950 

43  206.288  5  668 

44  266  343  6  773 

45  192  261  5  139 

46  216.283  5  599 

47  206  288  5  788 

48  228  294  5  664 

49  166  223  4  225 

50  192.261  5  139 

51  194.277  5  273 

52  192  261  5  139 

53  180  250  4  784 

54  2(M  272  5.214 


3.2189 

2.1898 

7.0226 

3.2189 

2.1910 

7.1475 

3.2958 

2.2821 

7.3402 

3.0445 

2.0310 

6.6000 

3.2189 

2.1898 

6.9618 

3,0910 

2.1299 

6.8191 

3.0445 

2.0660 

6.6916 

3.1355 

2.1339 

6.8771 

3.1355 

2.1346 

6.8812 

3.1355 

2.1122 

6.9813 

3.0445 

2.0310 

6.5945 

3.1355 

2.1135 

6.8175 

3.0445 

2.0909 

6.6725 

3.0445 

2.0713 

6.6775 

3.1355 

2.1783 

6.9571 

2.9957 

1.9965 

6.5583 

3.0445 

2.0525 

6.7022 

3.0445 

2.0296 

6.6374 

2.9444 

1.9835 

6.3486 

2.9444 

2.0023 

6.4635 

2.9444 

1.9854 

6.4892 

3.0445 

2.0425 

6.6514 

3.0910 

2.1299 

6.4636 

2.9957 

1.9954 

6.5952 

2.9957 

1.9965 

6.5691 

2.7081 

1.8032 

5.7671 

2.9957 

2.0388 

6.5159 

2.8332 

2.0016 

6.0322 

3.0445 

2.1124 

6.6998 

3.0910 

2.1401 

6.7552 

2.9444 

2.0090 

6.3900 

2.9444 

1.9410 

6.3516 

2.8332 

2.0012 

6.0656 

3.1355 

2.1135 

6.9177 

2.9444 

1  9395 

6.3531 

2.8904 

1.9032 

6.1824 

3.0445 

2.0496 

6.7562 

2.8904 

1.9348 

6.3157 

2.8904 

1.9360 

6.2906 

3.0445 

2.0986 

6,5775 

3.0445 

2.0909 

6.6494 

2.8332 

1.8386 

6.0130 

2.8332 

1.8860 

6.1723 

3.0910 

2.0831 

6.7519 

2.7726 

.1.7986 

5.9393 

2.8904 

1.9296 

6.2290 

2.8332 

1.8647 

6.1080 

2.9444 

1.9379 

6.4313 

2.6391 

1.7249 

5.5620 

2.7726 

1.7991 

5.9358 

2.7726 

1.8866 

5.8875 

2.7726 

1.8004 

5.9104 

2.7081 

1.8103 

5.7372 

2.8332 

1.8319 

6.0757 
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TABLE  IV  (Continued) 


No 

MW 

logP 

Po 

'A'" 

30  H,- 

55 

56 

57 

58 

59 

60 

178.234 

202.256 

180.250 

192.261 

190.245 

178.234 

4.490 

4.950 

4.874 

5.139 

4.685 

4.490 

2.7081 

2.8332 

2.7081 

2.7726 

2.7726 

2.7081 

1.7267 

L8681 

1.7964 

1.7971 

1.8210 

1.7286 

5.7650 

6.0547 

5.7613 

5.9752 

5.8489 

5.7224 

while  retaining  a  low  probability  of  chance  correlations  (Pc  <  0.01  with  R 

>0.7)  [41],  . 

To  further  reduce  the  number  of  indices  available,  the  sets  ot 

topostructural  and  topochemical  indices  were  divided  into  subsets,  or 
clusters,  based  on  the  correlation  matrices  using  the  SAS  procedure 
VARCLUS  [42].  This  procedure  divides  the  set  of  indices  into  disjoint 
clusters,  such  that  each  cluster  is  essentially  unidimensional. 

From  each  cluster  we  selected  the  index  most  correlated  with  the  cluster, 
as  well  as  any  indices  which  were  poorly  correlated  with  their  cluster 
(/?'  <  0.70).  These  indices  were  then  used  in  the  modeling  of  the  dermal 
penetration  of  the  sixty  PAHs.  The  variable  clustering  and  selection  of 
indices  was  performed  independently  on  both  the  topostructural  and 
topochemical  sets  of  indices.  This  procedure  resulted  in  a  set  of  eight 
topostructural  indices  and  nine  topochemical  indices. 


Statistical  Analysis  and  Hierarchical  QSAR 

Regression  modeling  of  the  thirteen  distinct  sets  of  indices  was  accomplished 
using  the  SAS  procedure  REG  [42].  This  hierarchical  approach  to  QSAR 
modeling  begins  with  the  simplest  parameters,  the  TSIs.  Increasingly 
complex  levels  of  parameters  are  then  added.  The  indices  from  the  best  TSl 
model  are  retained  and  the  set  of  TCls  are  added.  The  indices  included  in  the 
best  model  from  this  second  step  are  then  combined  with  the  geometric 
indices  and  regression  modeling  is  conducted  again.  The  quantum  chemical 
parameters  from  the  various  Hamiltonians  are  treated  as  unique  sets  of 
descriptors  and  are  individually  modeled  with  the  other  parameters,  e.g.,  the 
AMI  and  PM3  indices  are  never  used  in  the  same  model.  The 
physicochemical  descriptors  were  included  in  each  step  of  the  modeling 
process  to  determine  how  they  compare  with  the  theoretical  descriptors. 

In  addition  to  the  seven  models  developed  using  the  hierarchical 
approach,  seven  other  models  were  generated.  These  models  used  the 
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individual  sets  of  descriptors  only  to  determine  the  potential  contribution  of 
each  set.  Thus  these  models  were  generated  using  TCI  indices  only, 
geometric  indices  only,  quantum  chemical  indices  only,  or  physicochemical 
indices  only. 


RESULTS 

The  variable  clustering  of  the  TSIs  resulted  in  the  selection  of  eight  indices: 
IC,  O,  ^Xch,  ^3-  LogP  and  MW  were  added  to  the  set  of 

independent  variables,  for  this  model  and  all  subsequent  models,  because 
other  studies  have  shown  the  importance  of  these  parameters  in  predicting 
dermal  penetration  [1,  19].  All-possible  subsets  regression  resulted  in  the 
selection  of  the  following  one-parameter  model  for  the  estimation  of  dermal 
penetration: 


%DP-  224.1  -67.9Po 

n  =  6Qr^  0.675  s^lAF=  120.6  ^  ^ 

In  the  next  step  of  the  hierarchy,  the  nine  TCIs  selected  by  variable 
clustering  (ICq.  SIC,.  SIC4,  CIC,.  'X^  ^XVh,  ''X'.  ^  X^.  J^)  were  combined 
with  Pq,  logP,  and  MW  and  all-subsets  regression  was  conducted  on  this 
set.  The  following  model  resulted: 

%DP  =  179.7  -  78.8^X^ 

(2) 

n  =  60  r  -  0.695  .v-  7.1  132.0 

Interestingly,  neither  the  lopostruciural  index  from  the  first  model  or  either 
of  our  physicochemical  descriptors  were  selected.  Neither  the  geometrical 
nor  any  of  the  quantum  chemical  indices  added  significantly  to  the  model 
produced  in  the  second  step  of  the  hierarchy.  In  all  cases,  *X^  produced  the 
best  model. 

To  continue  our  comparative  study  of  the  indices,  models  were 
constructed  using  only  geometric  indices,  only  quantum  chemical  indices, 
and  only  physicochemical  parameters.  The  use  of  geometric  parameters 
alone  resulted  in  a  one-parameter  model  which  performed  as  well  as  the  TSl 
model: 


%DP  186.0  -  25.4’° W 
n  ■-  60  r  =  0.673  s  =-1 A  F  =  1 19.3 


(3) 
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The  models  using  only  quantum  chemical  indices  were  all  discarded  since 
none  resulted  in  an  explained  variance  (r^)  greater  than  25%. 

Finally,  modeling  was  conducted  using  log  P  and  MW.  Molecular  weight 
proved  to  be  a  better  descriptor  for  modeling  the  dermal  penetration  of 
PAHs  than  was  logP.  This  step  resulted  in  the  following  one-parameter 
model: 


%DP  =  90.6  -  0.3MW 
M  =  60  =  0.674  s  =  lAF=^  120.0 

The  values  for  the  parameters  used  in  the  final  models  (Pq,  MW) 

have  been  provided  in  Table  IV. 


DISCUSSION 

The  goal  of  this  paper  was  to  develop  models  for  estimating  the  dermal 
penetration  of  chemicals  using  computed  molecular  descriptors.  To  this  end 
we  used  topostructural,  topochemical,  geometric,  and  quantum  chemical 
parameters  which  can  be  computed  directly  from  the  molecular  structure. 
We  also  used  calculated  log  F  (CLOGP)  and  molecular  weight  as  descriptors 
in  the  development  of  regression  equations. 

Our  results  show  that  topostructural  indices  (Pq),  topochemical  para¬ 
meters  ('X^),  geometrical  descriptors  (^^W)  and  physicochemical  properties 
(MW)  are  almost  equally  effective  in  predicting  the  dermal  penetration  of 
the  sixty  PAHs  studied  in  this  paper.  Additionally,  we  attempted  to  develop 
hierarchical  QSAR  models  by  adding  selected  topochemical,  geometric,  and 
quantum  chemical  indices  to  the  set  of  topostructural  parameters  retained 
by  the  variable  clustering  method.  This  procedure  did  not  result  in  any 
improvement  in  the  models.  Interestingly,  log  P  and  the  quantum  chemical 
descriptors  gave  QSAR  models  which  were  inferior  to  the  predictive 
equations  generated  from  topostructural,  topochemical  or  geometric 

variables.  ^  ^ 

Of  the  four  final  models  which  were  generated  as  part  of  this  study,  ‘X  ,  a 
simple  bond-type  connectivity  index  which  accounts  for  general  size  and 
bonding  patterns  within  the  molecule,  provided  the  best  correlation  with 
percent  dermal  penetration.  Figure  1  shows  the  correlation  between 
experimental  dermal  f>enetration  and  estimated  dermal  penetration  using 
‘X^  and  Figure  2  demonstrates  the  scatter  of  the  residuals.  Thus,  there  are 
no  apparent  co-variance  problems  within  this  model. 
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P*rc«nt  Dcfmtl  P*n«(ratton 
(Exp«rtm«nUl) 

FIGURE  1  Scaiterploi  of  experimentally  determined  percent  dermal  penetration  (VoDP)  vs. 
estimated  ®/oDP  using  Eq.  (2)  for  a  set  of  60  polycyclic  aromatic  hydrocarbons. 


FIGURE  2  Paiiern  of  residual  errors  for  the  estimation  of  the  percent  dermal  penetration 
(VoDP)  of  60  polycyclic  aromatic  hydrocarbons  using  Eq.  (2). 


QSAR  models  developed  in  this  study  are  in  line  with  other  published 
models  for  dermal  penetration  of  chemicals.  Potts  and  Guy  [1]  developed 
models  for  dermal  penetration  of  diverse  chemicals  using  MW,  molar 
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volume  (MV)  and  logP.  Roy  et  ai,  developed  dermal  penetration  models 
for  the  same  set  of  sixty  PAHs  analyzed  in  this  paper  [19]  using  logP  and 
several  molecular  shape  descriptors  in  the  development  of  regression  models 
(^2  _  540/q)  parameters  used  by  these  authors  quantify  generalized 
shape,  size,  and  hydrophobicity  of  chemicals,  so  it  is  not  surprising  that 
parameters  such  as  Po,  *X^  and  MW  are  well  correlated  with  the 

dermal  penetration  of  PAHs  since  these  parameters  also  quantify  general 
aspects  of  the  size  and  shape  of  molecules. 

Based  on  the  results  of  this  study,  it  seems  that  physical  size  and  shape  are 
more  important  in  determining  the  dermal  penetration  of  PAHs  than 
lipophilicity.  This  conclusion  would  support  the  notion  that  larger 
molecules  must  traverse  water-filled  pores  rather  than  moving  across  the 
dermal  membrane.  This  would  also  account  for  the  findings  of  Roy  et  ai 
[19]  which  showed  an  inverse  relationship  between  the  lipophilicity  of  PAHs 
and  their  dermal  penetration.  The  more  lipophilic  the  compound,  the  less 
likely  it  is  to  travel  through  a  hydrophilic  channel.  Additionally,  it  should  be 
noted  that  while  these  results  are  on  par  with  similar  studies,  they  also 
demonstrate  that  there  is  still  something  missing  in  this  characterization  of 
the  dermal  penetration  of  PAHs. 
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Abstract  -  Adequate  experimenul  data  necessary  for  hazard  asse^ent  is  not  available  for  the 
majority  of  environmenul  poUuUnB  and  chemicals  in  commerce,  ^^is  has  led  to  the  increasing 
use  of  theoretical  structural  parameters  in  the  hazard  estimation  of  such  >"  ‘^is  pa¬ 

per  we  have  used  a  hierarchical  quantitative  strucnire-actmty  relationship  (QSAR)  approach  in¬ 
volving  topological  indices,  geometrical  3-dimensional  (3D)  indices,  and  quantum  themical 
indices  to  estimate  the  mutagenicity  of  a  set  of  95  aromatic  and  heuroaromatic  amines.  ;^e 
results  show  that  topological  indices  explain  the  major  part  of  the  variance  in  mutagenicity,  e 
addition  of  quantum  chemical  indices  to  the  set  of  descriptors  makes  some  improvement  in  the 
predictive  models 


The  assessment  of  the  environmental  and  human  health  hazard  posed  by  chemicals  is 
frequently  carried  out  using  insufficient  experimental  data.  This  is  true  for  industrial 
chemicals  as  well  as  for  substances  identified  in  industrial  effluent,  hazardous  waste  sites 
and  environmental  monitoring  surveys  (Auer  et  al.  1990).  In  1984,  the  National  R^earch 
Council  (NRQ  studied  the  avaUabiUty  of  toxicity  data  on  industrial  chemicals  and  found 
that  many  of  these  chemicaU  have  very  little  or  no  test  data  (1984).  About  15  million 
distinct  chemical  entities  have  been  registered  with  the  Chemical  Abstract  Ser^ce  (CAS) 
and  the  list  is  growing  by  nearly  750.000  per  year.  Out  of  these  chemicals,  about  1.000 
enter  into  societal  use  every  year  (Arcos  1987).  Very  few  of  these  chemicals  have  empin- 
cal  properties  needed  for  hazard  assessment.  In  the  United  States,  the  Toxic  Substances 
Control  Act  (TSCA)  inventory  has  over  72.000  entries,  and  the  list  is  growing  by  nearly 
3  000  per  year  (U.S.  General  Accounting  Office  IGAO)  1993).  Of  the  some  3,000  chemi¬ 
cals  submitted  yearly  to  the  United  States  Environmental  Protection  Agency  (USEPA)  for 
the  premanufacture  notification  (PMN)  process,  less  than  50%  have  any  experimental 
data  at  all,  less  than  15%  have  empirical  mutagenicity  data,  and  only  about  6%  have  ec- 
otoxicological  and  environmental  fate  data.  The  Superfund  list  of  hazardous  substances 
has  only  limited  data  for  many  of  the  more  than  700  chemicals  as  well  (Auer  et  al.  1990). 
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This  pervasive  lack  of  empirical  data  shows  the  real  need  for  the  development  of  meth¬ 
ods  that  can  estimate  environmental  and  toxic  properties  of  chemicals  using  parameters 
that  can  be  calculated  directly  from  molecular  structure.  In  recent  years  we  have  been  in¬ 
volved  in  the  development  of  such  models  (Basak  and  Magnuson  1983;  Basak  1987, 
1990;  Basak  et  al.  1988, 1994;  Balaban  et  al.  1994;  Basak  and  Grunwald  1994a,  1994b, 
1995a-1995e,  1996;  Basak,  Bertelsen,  and  Grunwald  1995;  Basak,  Gute,  and  Grunwald 
1995, 1996a,  1996b;  Basak,  Gute,  and  Drewes  1996;  Basak,  Grunwald  and  Niemi  1997; 
Basak  and  Gute  1997).  Specifically,  we  have  used  graph  theoretic  indices,  geometrical  (3- 
dimensional  [3D])  parameters,  and  semiempirical  quantum  chemical  indices  in  the  de¬ 
velopment  of  quantitative  structure-activity  relationship  (QSAR)  models  pertinent  to 
biomedicinal  chemistry  and  toxicology.  In  this  chapter,  we  have  used  a  hierarchical  ap¬ 
proach  in  the  development  of  QSARs  for  a  group  of  95  aromatic  and  heteroaromatic 
amines  using  topological  indices,  3D  parameters,  and  a  set  of  quantum  chemical  descrip¬ 
tors. 

The  purpose  in  using  a  hierarchical  approach  is  to  begin  to  look  at  the  importance  of  the 
contribution  of  different  classes  of  parameters  to  modeling  physicochemical  or  biologi¬ 
cally  relevant  properties.  To  this  end  we  ask  these  questions:  What  nonempirical  molecu¬ 
lar  information  is  adequate  for  the  estimation  of  mutagenic  potency?  Is  specific  chemical 
or  quantum  chemical  information  necessary,  or  do  simple  structural  descriptors  do  an 
adequate  job?  These  questions  should  lead  us  to  a  deeper  understanding  of  the  principles 
and  molecular  basis  for  determining  mutagenic  potency. 


Theoretical  Methods 


Database 

A  set  of  95  aromatic  and  heteroaromatic  amines  previously  collected  from  the  literature 
by  Debnath  et  al.  (1992)  were  used  to  study  mutagenic  potency.  The  mutagenic  activities 
of  these  compounds  in  S.  typhimurium  TA98  +  S9  microsomal  preparation  are  expressed 
as  the  mutation  rate.  ln(R),  in  natural  logarithm  (revertants/nanomole).  Table  17-1  lists 
the  compounds  used  in  this  study  and  their  experimentally  measured  mutation  rates. 

Computation  of  topological  indices 

Topological  indices  (Tis)  used  in  this  study  have  been  calculated  by  POLLY  2.3  (Basak  et 
al.  1988),  which  can  calculate  a  total  of  102  indices.  These  indices  include  Wiener  index 
(Wiener  1947),  connectivity  indices  (Randic  1975;  Kier  and  Hall  1986),  information 
theoretic  indices  defined  on  distance  matrices  of  graphs  (Bonchev  and  Trinajstic  1977; 
Raychaudhury  et  al.  1984),  a  set  of  parameters  derived  on  the  neighborhood  complexity 
of  vertices  in  hydrogen-filled  molecular  graphs  (Basak  et  al.  1980;  Basak  and  Magnuson 
1983;  Roy  et  al.  1984;  Basak  1987),  as  well  as  Balaban  sj  indices  (Balaban  1982, 1983, 
1986).  Table  17-2  provides  brief  definitions  for  the  topological  indices  included  in  this 
study. 
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Table  17-1  Observed  and  estimated  mutagenic  potency 
[in(revertants/nmol)l  for  95  aromatic  and  heteroaromatic  amines 


Nr. 

Compound 

Exp.  ln(R) 

Est  ln(R) 
(Equation  17-10) 

1 

2-bromo-7-aminofluorene 

2,62 

1.10 

2 

2-methoxy-5-methylaniline  (p-cresidine) 

-2.05 

-3.13 

3 

5-aminoquinoline 

-2.00 

-2.30 

4 

4-ethoxyaniline  (p-phenetidine) 

-2.30 

-3.76 

5 

1-aminonaphthalene 

-0.60 

-0.32 

6 

4-aminofluorene 

1.13 

0.44 

7 

2-aminoanthracene 

2.62 

1.61 

8 

7-aminofluoranthene 

2,88 

2.54 

9 

8-aminoquinoline 

-1.14 

-1.66 

10 

1.7-diaminophenazine 

0.75 

1.36 

11 

2-aminonaphthalene 

-0.67 

-0.80 

12 

4-aminopyrene 

3.16 

3.10 

13 

3-amino-3'-nitrobiphenyl 

—0.55 

-0.19 

14 

2,4.5-trimethylaniline 

-1.32 

-0.74 

15 

3*aminofluorene 

0,89 

0.74 

16 

3.3'-dichlorobenzidine 

0.81 

0.24 

17 

2.4-dimethylaniline  (2.4-xylidme) 

-2.22 

-1.63 

18 

2,7*diaminofluorene 

0.48 

0.97 

19 

3-aminofluoranthene 

3.31 

2.57 

20 

2*aminofluorene 

1-93 

1.08 

21 

2‘aminO‘4'-nitrobiphenyl 

-0.62 

0.37 

22 

4-aminobiphenyl 

-0.14 

0.06 

23 

3-methoxy-4-methylaniline  (o-cresidine) 

-1.96 

-3.27 

24 

2'aminocarbazole 

0.60 

0.60 

25 

2*amino-5-niirophenol 

-2.52 

-2.01 

26 

2,2’-diaminobiphenyl 

-1.52 

-1.24 

27 

2*hydroxy-7-aminofluorene 

0.41 

1.61 

28 

1-aminophenanthrene 

2.38 

1.80 

29 

2.5-dimethylaniline  (2,5-xylidine) 

-2.40 

-1.55 

30 

4-aminO“2'-nitrobiphenyl 

-0.92 

•”0.50 

31 

2*amino-4-methylphenoI 

-2.10 

-2.43 

32 

2-aminophena2ine 

0.55 

1.32 

33 

4-aminophenylsulfide 

0.31 

-0.47 

34 

2,4-dinitroaniline 

-2.00 

-0.75 

35 

2.4-diaminoisopropylben2cne 

-3.00 

-3.36 

36 

2.4-difluoroaniline 

-2.70 

-1.29 

37 

4,4'-methylenedianiline 

-1.60 

-0.97 

38 

3,3’-dimethylbenzidine 

0.01 

-0.23 

39 

2-aminofluoranthcne 

3.23 

2.66 

40 

2-amino-3'-nitrobiphenyl 

-0.89 

-0.42 

41 

l-aminofluoranthene 

3.35 

2.23 

42 

4.4'-ethylenebis  (aniline) 

-2.15 

-0.92 

43 

4<hloroaniline 

-2.52 

-2.94 
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Table  17-1  continued 


Est.  ln(R) 


Nr. 

Compound 

Exp.  ln(R) 

(Equation  17- 

44 

2-aminophenanthrene 

2.46 

1.96 

45 

4-fluoroaniline 

-3.32 

-2.57 

46 

9-aminophenanthrene 

2.98 

1.13 

47 

3.3'-diaminobiphenyl 

-1.30 

^.20 

48 

2-aminopyrene 

3.50 

2.58 

49 

2,6'dichloro-l  ,4-phenylenediamine 

-0.69 

-1.46 

50 

2-amino-7-acetamidofluorene 

1.18 

0.89 

51 

2.8-diaminophenazine 

1.12 

1.55 

52 

6-aminoquinoline 

-2.67 

-2.31 

53 

4-methoxy‘2‘methylaniline  (m-Cresidine) 

-3.00 

-2.44 

54 

3-amino-2'-nitrobiphenyl 

-1.30 

-0.90 

55 

2 ,4'-diaminobiphenyl 

-0.92 

-0.40 

56 

l.O-diaminophenazine 

0.20 

0.20 

57 

4-aminophenyldisulfide 

-1.03 

-1.00 

58 

2-bromo-4,6-dinitroaniline 

-0.54 

-1.25 

59 

2.4‘diaminO'n'butyIbenzene 

-2.70 

-3.72 

60 

4>aminophenylether 

-1.14 

-0.76 

61 

2*aminobiphenvI 

-1.49 

-0.77 

62 

1.9-diaminophena2ine 

0.04 

0.09 

63 

1-aminofluorene 

0.43 

0.28 

64 

S-aminofiuoranthene 

3.80 

2.69 

65 

2-chloroaniline 

-3.00 

-2.37 

66 

2-amino-a,a,atrinuorotoluene 

-0.80 

-1.63 

67 

2-aminO’l-nitronaphthalene 

-1.17 

-0.90 

68 

3'amino-4'-nitrobiphenyl 

0.69 

0.14 

69 

4-bromoaniline 

-2.70 

-3.08 

70 

2-amino-4<hlorophenol 

-3.00 

-2.39 

71 

3,3’*dimethoxybenzidine 

0.15 

0.05 

72 

4-cyclohexylaniline 

-1.24 

-0.73 

73 

4-phenoxyaniline 

0.38 

-0.50 

74 

4,4'-methylenebis(o-ethylanilme) 

-0.99 

-0.51 

75 

2'amino-7-nitrofluorcne 

3.00 

1.19 

76 

benzidine 

-0.39 

-0.52 

77 

l-amino-4-nitronaphthalene 

-1.77 

-0.95 

78 

4*amino-3’-niirobiphenyl 

1.02 

0.47 

79 

4’amino-4'-nitrobipheny! 

1.04 

0.73 

80 

1-aminophenazme 

-0.01 

1.28 

81 

4.4’*melhylenebis(o-fiuoroaniline) 

0.23 

0.41 

82 

4*chloro-2-nitroaniline 

-2.22 

-2.06 

83 

3-aminoquinoline 

-3.14 

-2.22 

84 

3-aminocarbazole 

-0.48 

0.60 

85 

4-chlofO-1.2*phenylenediamine 

-0.49 

-2.01 

86 

3*aminophenanihrene 

3.77 

1.79 

87 

3.4’*diaminobiohenvl 

0.20 

-0.34 
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Table  17-1  continued 


Nr. 

Compound 

Exp.  ln(R) 

Est.ln(R) 
(Equation  17-10) 

88 

89 

90 

91 

92 

93 

94 

95 

1-aminoanthracene 

l-aminocarbazole 

9-aminoanthracene 

4-aminocarbazole 

6-aminochrysene 

l-aminopyrene 

4-4’-methylenebis(o-isopropyl-aniline) 

2,7-diammophenazine 

1.18 

-1.04 

0.87 

-1.42 

1.83 

1.43 

-1.77 

3.97 

1.86 

0.65 

1.15 

0.38 

3.41 

3.51 

-1.13 

1.93 

Computation  of  geometrical  indices 

Van  der  Waal’s  volume,  Vw  (Bondi  1964;  Moriguchi  et  al.  1975;  Mojiguchi  and  Kanada 
1977)  was  calculated  using  SYBYL  6.2  (Tripos  Associates,  Inc.  1^94)  Jhe  3D  Wiener 
numbers  (Bogdanov  et  al.  1989)  were  calculated  by  SYBYL  using  an  SPL  (^SYm  Pro¬ 
gramming  Unguage)  program  developed  in  our  laboratory  Calculation  of  3D  Wiener 
Ambers  consists  of  the  sum  entries  in  the  upper  triangular  submatnx  of  the  topo¬ 
graphic  Euclidean  distance  matrix  for  a  molecule.  The  3D  coordinates  for  the  atoms  were 
determined  using  CONCORD  3.2.1  (Tripos  1993).  Two  variants  of  the  3D  Wiener  num¬ 
ber  were  calculated;  ^^Wh  and  For  ^°Wh.  hydrogen  atoms  are  included  in  the  com¬ 

putations,  and  for  ^'^W,  hydrogen  atoms  are  excluded  from  the  computations. 

Computation  of  quantum  chemical  parameters 

The  quantum  chemical  parameters  Ehomo.  Bhomoi-  Elumo.  Elumol 
culated  for  all  of  the  following  semiempirical  Hamiltonians:  PM3,  MNDO, 

MlNDO/3.  These  parameters  were  calculated  by  MOPAC  6.00  in  the  SYBYL  interface 
(Stewart  1990).  One  difficulty  was  encountered  in  using  the  MlNDO/3  Hamiltonian. 
This  particular  interface  does  not  include  the  information  necessary  for  handling  bro¬ 
mine  present  in  3  of  the  95  molecules.  To  avoid  omitting  any  compounds  from  one  ot 
the  models,  we  accounted  for  the  bromine  by  substituting  dummy  atoms  which  were  as¬ 
signed  the  Gasteiger-Huckel  charges  calculated  for  the  onginal  bromine  atoms.  These 
molecules  containing  the  dummy  atoms  with  assigned  charges  were  then  entered  into 
MOPAC  for  calculation. 

Data  reduction 

Initially,  all  Tls  were  transformed  by  the  natural  logarithm  of  the  index  plus  1 .  This  was 
done  because  the  scale  of  some  indices  may  be  several  orders  of  magnitude  greater  t  an 
that  of  other  indices,  and  other  indices  may  equal  0.  The  geometric  indices  were  trans- 
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Symbol 


I? 

I? 

w 

p 

HV 

HD 

IC 

*ORB 

0 

Ml 

M2 

ICr 

SICr 

QQ 

‘^Xc 

•’Xpc 

•’Xch 

hyb 

'’Xb. 

'■Xk 

•■Xt 

''X&. 

‘'Xk 

Ph 

J 

J"* 

r 

Vw 

»w 

EWh 


Table  17-2  Symbols  and  definitions  of  topological  and  geometrical  parameters 
Definition 

Information  index  for  magnitudes  of  distances  between  all  possible  pairs  of  vertices  of  a 
graph 

Mean  information  index  for  magnitude  of  distance 

Wiener  index  =  half-sum  of  off-diagonal  elements  of  distance  matrix  of  a  graph 
Degree  complexity 
Graph  vertex  complexity 
Graph  distance  complexity 

Information  content  of  distance  matrix  partitioned  by  frequency  of  occurrences  of 
distance  h 

Information  content  or  complexity  of  hydrogen-suppressed  graph  at  its  maximum 
neighborhood  of  vertices 

Order  of  neighborhood  when  ICf  reaches  its  maximum  value  for  the  hydrogen-filled  graph 
A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 
A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all  neighboring 
(connected)  vertices 

Mean  information  content  or  complexity  of  a  graph  based  on  r  *  (r  =  0-6)  order 
neighborhood  of  vertices  in  a  hydrogen-filled  graph 

Structural  information  content  for  r  *  (r  =  0-6)  order  neighborhood  of  vertices  in  a 
hydrogen-filled  graph 

Complementary  information  content  for  r*^  (r  =  0-6)  order  neighborhood  of  vertices  in  a 

hydrogen-filled  graph 

Path  connectivity  index  of  order  h  =  0-6 

Ouster  connectivity  index  of  order  h  =  3-5 

Path-cluster  connectivity  index  of  order  h  =  4-6 

Chain  connectivity  index  of  order  h  =  5. 6 

Bond  path  connectivity  index  of  order  h  =  0-6 

Bond  cluster  connectivity  index  of  order  h  =  3. 5 

Bond  chain  connectivity  index  of  order  h  =  5, 6 

Bond  path-cluster  connectivity  index  of  order  h  =  4-6 

Valence  path  connectivity  index  of  order  h  =  0-6 

Valence  cluster  conneaivity  index  of  order  h  =  3. 5 

Valence  chain  connectivity  index  of  order  h  =  5. 6 

Valence  path-cluster  conneaivity  index  of  order  h  =  4-6 

Number  of  paths  of  length  h  =  0-10 

Balaban's  J  index  based  on  distance 

Balaban's  J  index  based  on  bond  types 

Balaban’s  J  index  based  on  relative  electronegativities 

Balaban’s  J  index  based  on  relative  covalent  radii 

van  der  Waal's  volume 

3D  Wiener  number  for  the  hydrogen-suppressed  geometric  distance  matrix 
3D  Wiener  number  for  the  hydrogen-filled  geometric  distance  matrix 
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formed  by  the  natural  logarithm  of  the  index  for  consistency;  the  addition  of  1  was  un¬ 
necessary. 

The  set  of  91  TIs  was  partitioned  into  2  distinct  sets:  topostructural  indices  and  to- 
pochemical  indices.  Topostructural  indices  are  indices  that  encode  information  about 
the  adjacency  and  distances  of  atoms  (vertices)  in  molecular  structures  (graphs)  irrespec¬ 
tive  of  the  chemical  nature  of  the  atoms  involved  in  the  bonding  or  factors  like  hybrid¬ 
ization  states  of  atoms  and  number  of  core/valence  electrons  in  individual  atoms. 
Topochemical  indices  are  parameters  that  quantify  information  regarding  the  topology 
(connectivity  of  atoms)  as  well  as  specific  chemical  properties  of  the  atoms  comprising 
a  molecule.  Topochemical  indices  are  derived  from  weighted  molecular  graphs  where 
each  vertex  (atom)  is  properly  weighted  with  selected  chemical/physical  properties. 
These  sets  of  the  indices  are  shown  in  Table  17-3. 


Table  17-3  Classification  of  parameters  used  in  developing  models  for  mutagenic  potency  (ln(R)) 


Topostructural 

Topochemical 

Geometric 

Quantum  chemical; 

AMI.  PM3.  MNDO,  MINDO/3 

I? 

loRB 

v« 

^HOMO 

i? 

ICo-ICe 

3DW 

Ehomoi 

W 

SlCo-SlCe 

®w„ 

Elumo 

P 

CICo-CIQ 

Elumoi 

Hv 

oj^b  .tyb 

AH/ 

H! 

and^x^c 

1C 

%  and^xK 

0 

*)Ck 

M. 

Mj 

3x«c  and^X'c 

‘a-*x 

5X^  and^xK 

3Xcand»Xc 

%n,and6Xa, 

%-c-^Xk 

jx 

Pfl'P  10 

j 
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According  to  Topliss  and  Edwards  (1979),  in  conducting  QSAR  studies  it  is  important  to 
bear  in  mind  that  the  indiscriminate  use  of  too  many  independent  variables  can  lead  to 
spurious  (chance)  correlations.  Using  their  findings,  we  have  determined  that,  for  a  set  of 
95  compounds,  no  more  than  60  independent  variables  can  be  used  in  generating  regres¬ 
sion  analyses  with  explained  variance  (i?^)  of  0.7  or  greater.  It  must  be  kept  in  mind  that 
this  is  the  total  number  of  variables  initially  used  in  modeling,  not  the  final  number  of 
variables  used  in  the  model.  This  number  of  independent  variables  should  keep  the  prob¬ 
ability  of  chance  correlations  below  the  0.01  level. 

To  reduce  the  number  of  independent  variables  that  we  would  use  for  model  construc¬ 
tion,  the  sets  of  topostructural  and  topochemical  indices  were  further  divided  into  sub¬ 
sets,  or  clusters,  based  on  the  correlation  matrix  using  the  SAS  procedure  VARCLUS  (SAS 
1988).  The  VARCLUS  procedure  divides  the  set  of  indices  into  disjoint  clusters  so  that 
each  cluster  is  essentially  unidimensional. 

From  each  cluster,  we  selected  the  index  most  correlated  with  the  cluster,  as  well  as  any 
indices  that  were  poorly  correlated  with  the  cluster  (r  <  0.70).  These  indices  were  then 
used  in  the  modeling  of  mutagenic  potency  of  aromatic  and  heteroaromatic  amines.  The 
variable  clustering  and  selection  of  indices  were  performed  independently  for  both  the 
topostructural  and  topochemical  subsets. 

Statistical  analysis  and  hierarchical  QSAR 

Regression  modeling  w'as  accomplished  using  the  SAS  procedure  REG  on  13  sets  of  indi¬ 
ces.  These  sets  were  constructed  as  part  of  a  hierarchical  approach  to  QSAR  model  devel¬ 
opment.  The  hierarchy  begins  with  the  simplest  indices,  the  topostructural.  After  using 
the  topostructural  indices  to  model  the  activity,  we  then  proceed  to  add  the  next  level  of 
complexity,  the  topochemical  indices  from  the  clustering  procedure,  and  proceed  to 
model  the  activity  using  these  parameters.  Likewise,  the  indices  included  in  the  model 
selected  from  this  procedure  are  combined  with  the  indices  from  the  next  level,  the  geo¬ 
metrical  indices,  and  modeling  is  conducted  once  again.  Finally,  the  best  model  utilizing 
topostructural,  topochemical,  and  geometrical  indices  is  combined  with  the  quantum 
chemical  parameters  and  modeling  is  conducted.  This  final  step  was  repeated  4  times, 
each  time  using  quantum  chemical  parameters  from  a  different  semiempirical  Hamilto¬ 
nian,  namely,  AMI.  PM3.  MNDO.  MINDO/3.  Thus  quantum  chemical  models  are  de¬ 
veloped  individually,  one  using  the  AMI  parameters,  one  using  the  MNDO  parameters, 
one  using  the  PM3  parameters,  and  one  using  the  MINDO/3  parameters.  The  regression 
analysis  resulted  in  the  final  selection  of  indices  for  each  of  the  models. 
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Results  and  Discussion 

The  variable  clustering  of  topostructural  and  topochemical  indices  resulted  in  8  topo- 
structural  and  13  topochemical  indices  being  retained  for  model  construction  (see  Table 
17-3).  The  results  for  the  all  possible  subsets’  regression  analyses  have  been  summarized 
in  Table  17-4.  Because  all  sets  were  well  under  25  parameters,  all  possible  subsets’  regres¬ 
sions  were  used  for  all  analyses. 


Table  17-4  Summary  of  regression  results  for  all  classes  of  parameters 


Equation 

Parameter  class 

Variables  included 

F 

/?2 

s 

17-1 

topostructural 

0.«Xpc.Po.J 

58.1 

0.721 

1.04 

17-2 

topochemical 

41.1 

0.737 

1.02 

17-3 

geometric 

3DW 

61.8 

0.399 

1.50 

17-4 

Ct:AMl 

EhomoI’^lumO' M- 

31.8 

0.512 

1.37 

17-5 

Qc:  MNDO 

EhOMOI'^LUMO 

54.7 

0.543 

1.31 

17-6 

Qc-  MINDO/3 

EhOMO'^LUMO' 

32.4 

0.517 

1,36 

17-7 

Qc;PM3 

EhOMO  ♦  ^  HOMO  \  •  ^  LUMO 

30.0 

0.497 

1.39 

17-8 

topostructural  + 
topochemical 

«Xpc.Po.J.SIC2.SIC4.5x‘c 

44.5 

0.752 

0.99 

17-9 

topostructural  + 
topochemical  + 
geometric 

♦Xk-J.SICVSIQ,  ,®w 

42.9 

0.746 

1.00 

17-10 

topostructural  + 
topochemical  + 
geometric  + 

AMI 

♦Xk.PoJ.SICj.SIC,.  . 

EhOMOI- 

35.8 

0.791 

0.92 

17-11 

topostructural  + 
topochemical  + 
geometric 

MNDO 

^.Po,J.SIC2.SIC4.  5%^.  AH/ 

40.4 

0.765 

0.97 

17-12 

topostructural  -f 
topochemical 
geometric 
MlNDO/3 

♦XPCP0.JSIC2.SIC4.ELUMO 

45.8 

0.758 

0.98 

17-13 

topostructural  + 
topochemical 
geometric  + 

PM3 

4x,^.Po.J.SIC2.SIC4.  5%^  .AH/ 

39.7 

0.761 

0.98 
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As  can  be  seen  from  Table  17-4,  using  only  the  topostructural  class  of  indices  resulted  in 
a  4  parameter  model  to  estimate  ln(R)  with  a  variance  explained  {R  )  of  72.1%  and  a  stan¬ 
dard  error  (s)  of  1.04  (Equation  17-1).  The  Po  and  J  indices  are  related  to  the  size  and 
shape  of  molecular  graphs;  the  ^Xpc  encodes  information  about  the  degree  of  branching 
of  molecular  graphs;  the  0  parameter  is  related  to  the  degree  of  symmetry  of  graphs 
(Basak  et  al.  1987).  Therefore,  size,  branching,  and  symmetry  (or  complexity)  of  skeletal 
graphs  corresponding  to  molecular  structures  seem  to  be  the  predominant  factors  in 
determining  mutagenic  potency  of  the  set  of  95  aromatic  amines. 

The  second  step  of  the  hierarchical  method  combined  the  4  topostructural  parameters 
from  Equation  17-1  with  the  set  of  13  topochemical  parameters  The  resulting  model  for 
estimation  of  ln(R)  included  6  parameters  (Equation  17-8),  which  had  an  R  of  75.2%  and 
an  s  of  0  99.  Thus  we  see  that  the  addition  of  topochemical  information  does  lead  to  an 
increase  in  the  explained  variance,  improving  our  model  without  greatly  increasing  the 
number  of  independent  variables.  The  independent  variables  of  Equation  U-8  quantify 
1)  shape  and  size  of  molecular  graphs  0,  Po).  2)  branching  3)  molecular  complex¬ 
ity  /  redundancy  (SlCj,  SIC4),  and  4)  degree  of  cyclicity  (^Xc)-  h  may  be  mentioned  that 
we  have  found  very  similar  sets  of  topostructural  and  topochemical  parameters  useful  in 
estimating  normal  boiling  point,  octanol/water  partition  coefficient  (Basak  Gute,  and 
Grunewald  1996b),  and  vapor  pressure  (Basak,  Gute,  and  Grunewald  1997)  of  diverse 

sets  of  molecules. 

The  next  step  of  the  hierarchical  method  takes  this  topostructural  -h  topochemical  model 
and  adds  the  3  geometric  indices;  however,  this  actually  led  to  a  decrease  in  the  explained 
variance.  As  part  of  model  construction,  it  became  necessary  to  eliminate  Pq  froin  the  set 
of  indices  when  adding  the  hydrogen-suppressed  3D  Wiener  number  because  of  result¬ 
ing  problems  with  variance  inflation  between  the  2  parameters.  As  a  result,  the  mode 
that  retained  the  geometric  parameter  had  slightly  lower  R  and  s  values  than  the  mo  e 
using  topostructural  and  topochemical  only  (Equation  17-9).  This  being  the  case,  we 
chose  to  use  the  parameters  from  Equation  17-8  in  the  following  modeUng  with  the  quan¬ 
tum  chemical  parameters.  Thus,  the  last  4  models  were  aU  constructed  with  the  6  param¬ 
eters  from  Equation  17-8  and  all  6  quantum  chemical  parameters  for  the  particular 
Hamiltonian  methodology  available  for  modeling. 

As  can  be  seen  from  Table  17-4,  the  AMI  parameters  made  the  most  significant  contnbu- 
tion  to  our  hierarchical  modeling  procedure  {P?  =  79.1%.  s  =  0.92).  The  other  3  methods 
showed  only  minimal  improvement  over  the  topostructural  -h  topochemical  model. 

Finally  individual  models  using  only  topochemical.  only  geometrical,  and  only  quanturn 
chemical  parameters  were  constructed  to  further  our  understanding  of  the  individual 
contribution  of  these  different  types  of  parameters.  The  topochemical  model  was  the 
strongest  of  the  3.  with  the  geometrical  and  quantum  chemical  models  showing  little 
effectiveness.  The  topochemical  model  included  6  parameters  and  did  show  a  slight  in¬ 
crease  in  explained  variance  and  standard  error  over  the  topostructural  model. 
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The  goal  of  this  chapter  is  to  investigate  the  relative  effectiveness  of  theoretical  structural 
parameters  —  namely  topostructuraf  topochemical,  geometrical,  and  quantum  chemi¬ 
cal  parameters  —  in  predicting  the  mutagenicity  of  a  set  of  aromatic  and  heteroaromatic 
amines.  To  this  end,  we  used  a  hierarchical  approach  in  the  development  of  QSARs  using 
4  classes  of  molecular  descriptors. 

The  results  show  that  the  topostructural  parameters  explain  a  large  fraction  of  the  vari¬ 
ance  (/?^  in  the  mutagenic  potency  of  the  amines.  The  best  model  in  this  area  explained 
about  72%  of  variance  in  mutagenicity  using  O,  ^X?Cf  J-  These  indices  do  not  contain 
any  explicit  chemical  information  about  the  molecules.  The  large  explained  variance 
probably  indicates  that  general  structural  features  like  size,  shape,  symmetry,  and 
branching  play  a  major  role  in  determining  mutagenic  potency.  The  addition  of 
topochemical  variables  made  some  improvement  in  the  explained  variance.  The  best 
model  using  topostructural  and  topochemical  indices  explained  about  75%  of  variance  in 
mutagenicity.  The  addition  of  geometrical  parameters,  however,  did  not  make  any  im¬ 
provement  in  estimation.  Rnally,  the  addition  of  quantum  chemical  parameters  was  at¬ 
tempted.  Indices  from  AMI,  PM3.  MNDO,  and  MIND03  were  used  separately  in 
developing  the  QSAR  models.  While  addition  of  the  heat  of  formation,  dipole  moment, 
and  Ehomoi  parameters  calculated  by  the  AMI  method  provided  some  improvement  in 
the  estimation  of  ln(R),  parameters  calculated  by  PM3,  MIND03,  and  MNDO  did  not 
make  any  significant  improvement  in  the  estimation  of  mutagenic  potency.  The  calcu¬ 
lated  values  for  the  parameters  used  in  the  hierarchical  model  that  included  the  AMI  pa¬ 
rameters  (Equation  17-10)  are  presented  in  Table  17-5.  These  values  represent  the 
original,  nontransformed  values  for  all  indices  used  in  Equation  17-10.  Additionally,  Fig¬ 
ure  17-1  presents  a  scatterplot  of  observed  versus  estimated  mutagenic  potency  based  on 
Equation  17-10. 

Estimated  la{revertanls/ nmol) 
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Figure  17-1  Scatterplot  for  observed  ln(R)  versus  estimated  ln(R) 
using  Equation  17-10  for  set  of  95  aromatic  and  heteroaromatic 
amines 
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Table  17-5  Calculated  values  for  topostructural, 
topochemical,  and  AMI  quantum  chemical  parameters  used  in  Equation  17-1 


Nr.  Po  J  SIC  4  Ehomoi 


1 

2.482 

15 

1.722 

0.780 

0.966 

0.080 

-9.510998 

57.462489 

3.246 

2 

1.409 

10 

2.356 

0.824 

0.875 

0.059 

-9.198889 

-24,061979 

1.613 

3 

1.440 

11 

1.993 

0.831 

0.975 

0.058 

-9.528133 

51.959364 

2.993 

4 

0.841 

10 

2.132 

0.775 

0.818 

0.000 

-9.761040 

-22.045505 

1,782 

5 

1.440 

11 

1.993 

0.639 

0.931 

0.058 

-9.342732 

40.325881 

1.549 

6 

2.209 

14 

1.800 

0.697 

0.931 

0.109 

-9,019172 

53.561923 

1.377 

7 

2.148 

15 

1.673 

0.613 

0.885 

0.049 

-8.752501 

61.467301 

1.686 

8 

3.051 

17 

1.694 

0.616 

0.890 

0.119 

-8.883560 

90.631004 

1.061 

9 

1.440 

11 

1.993 

0.807 

0.975 

0.058 

-9.497513 

49.496038 

1.140 

10 

2.650 

16 

1.701 

0.703 

0.967 

0.083 

-8.759018 

93.256750 

2,202 

11 

1.292 

11 

1.932 

0.648 

0.907 

0.025 

-8.981140 

39.152911 

1.625 

12 

3.058 

17 

1.692 

0.593 

0.890 

0.112 

-9,017251 

86.180524 

1.025 

13 

2.289 

16 

1.879 

0.722 

0.951 

0.065 

-9.635184 

49.692122 

5.732 

14 

2.154 

10 

2.462 

0.622 

0.786 

0.167 

-9.195396 

-1.116909 

1.386 

15 

2.136 

14 

1.751 

0.704 

0.948 

0.080 

-8,880375 

53.383623 

1.407 

16 

3.115 

16 

1.884 

0.677 

0.755 

0.194 

-9.010987 

29.747467 

1.402 

17 

1.478 

9 

2.346 

0.719 

0.867 

0.083 

-9.402700 

5.680026 

1.423 

18 

2.482 

15 

1.722 

0.692 

0-766 

0.080 

-9.008264 

51.483002 

0.749 

19 

3.131 

17 

1.679 

0.592 

0.890 

0.128 

-8.745169 

113.597721 

1.348 

20 

2.132 

14 

1.739 

0.704 

0.948 

0.080 

-9.316509 

53.266008 

1.795 

21 

2.481 

16 

1.832 

0.699 

0.902 

0.103 

-10.009252 

50.464895 

5.573 

22 

1.351 

13 

1.789 

0.570 

0.836 

0.028 

-9.611345 

45.922022 

1.682 

23 

1.418 

10 

2.376 

0.824 

0.875 

0.059 

-9.233259 

-23.899670 

2.229 

24 

2.132 

14 

1.739 

0.715 

0.981 

0.057 

-8.382162 

66.295627 

1.688 

25 

2.126 

11 

2.396 

0.874 

0.942 

0.121 

-10.236383 

-21.118276 

6.030 

26 

1.945 

14 

1.963 

0.591 

0.755 

0.104 

-8.411351 

45.503434 

0.270 

27 

2.482 

15 

1.722 

0.791 

0.967 

0.080 

-9.366850 

8.492721 

1.867 

28 

2.332 

15 

1.763 

0.600 

0.951 

0.091 

-8.782735 

57.726120 

1.543 

29 

1.478 

9 

2.346 

0.696 

0.867 

0.083 

-9.229828 

5.699677 

1.431 

30 

2.293 

16 

1.944 

0.699 

0.902 

0.075 

-9.850974 

54.711440 

5.793 

31 

1.478 

9 

2.346 

0.847 

0.910 

0.083 

-9.261839 

-30.703134 

1.260 

32 

2.148 

15 

1.673 

0.651 

0.891 

0.049 

-9.205497 

91.251439 

1.882 

33 

1-221 

14 

1.685 

0.593 

0.845 

0.000 

-9.510446 

52.769884 

1.912 

34 

2.499 

13 

2.526 

0.777 

0.920 

0.107 

-11.360524 

25.435.777 

7.257 

35 

1.838 

11 

2.437 

0.722 

0.815 

0.131 

-8.792416 

3.913795 

2.561 

36 

1.478 

9 

2.346 

0.836 

0.962 

0.083 

-10.029053 

-69.256743 

2.575 

37 

1.630 

15 

1.681 

0603 

0.659 

0.000 

-8.406652 

39.288132 

1.394 

38 

3.115 

16 

1  884 

0656 

0.716 

0.194 

-8.782407 

29.805987 

2.494 

39 

2.913 

17 

1.674 

0.604 

0.905 

0.093 

-8.844299 

113,962366 

0.866 

40 

2.437 

16 

1.921 

0.716 

0.967 

0.103 

-9.940798 

79.401262 

6.265 

41 

3.058 

17 

1.700 

0.616 

0.920 

0.119 

-8.657007 

101.911673 

1.867 

42 

1.683 

16 

1.601 

0.606 

0.660 

0.000 

-8.707849 

57.273517 

2.562 

43 

0.816 

8 

2.192 

0.737 

0.812 

0.000 

-9.948850 

13.095294 

2.631 

44 

2.176 

15 

1.722 

0.606 

0.951 

0.057 

-8.807318 

59.927756 

1.359 
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Table  17-5  continued 


Nr. 

^PC 

Po 

j 

SICs 

SIC  4 

^OMOl 

AH/ 

45 

0.816 

8 

2.192 

0.737 

0.812 

0.000 

-10.025071 

-24.569648 

2.776 

46 

2.280 

15 

1.787 

0.603 

0.885 

0.091 

-8.826091 

57.985510 

1.608 

47 

1.641 

14 

1.861 

0.624 

0.755 

0.028 

-9.637290 

52.825739 

0.355 

48 

2.888 

17 

1.654 

0.569 

0.807 

0.077 

-8.537199 

81.775262 

1.644 

49 

2.006 

10 

2.487 

0.719 

0.812 

0.144 

-9,653936 

6.122184 

0.948 

50 

2.727 

18 

1.612 

0.786 

0.920 

0.080 

-9.409869 

19,708295 

4.954 

51 

2.497 

16 

1.667 

0.644 

0.771 

0.049 

-9.614724 

124.753819 

2.050 

52 

1.292 

11 

1.932 

0.831 

0.975 

0.025 

-9.345759 

50.639120 

2.728 

53 

1.574 

10 

2.330 

0.824 

0.875 

0.083 

-9.524426 

-23.745777 

1.831 

54 

2.234 

16 

1.984 

0.716 

0.967 

0.075 

-9.701876 

55.625683 

6.167 

55 

1.848 

14 

1.867 

0.628 

0.902 

0.066 

-8.529041 

45.389658 

1.889 

56 

2.802 

16 

1.739 

0.677 

0.755 

0.117 

-8.724272 

87.859343 

1.995 

57 

1.683 

16 

1.601 

0.584 

0.643 

0.000 

-8.694071 

52.783142 

3.652 

58 

3.074 

14 

2.661 

0.813 

0.920 

0.174 

-11.175279 

33.261219 

6.162 

59 

1.360 

12 

2.246 

0.740 

0.890 

0.059 

-8.803533 

-7.047410 

2.543 

60 

1.630 

15 

1.681 

0.579 

0.642 

0.000 

-8.589188 

21.521611 

2.589 

61 

1.292 

13 

1.833 

0.588 

0.884 

0.028 

-9.075139 

46.291223 

1.526 

62 

2.802 

16 

1.744 

0.677 

0.771 

0.117 

-8.760423 

87.878976 

2.958 

63 

2.293 

14 

1.786 

0.697 

0.931 

0.127 

-8.809819 

52.914796 

1.658 

64 

2.972 

17 

1.656 

0.613 

0.896 

0.093 

-8.672342 

86.560420 

1.569 

65 

1.138 

8 

2.279 

0.775 

0.962 

0.083 

-9.647217 

13.148070 

1.773 

66 

2.214 

11 

2.461 

0.788 

0.903 

0.250 

-10.328717 

-135.798912 

4.070  . 

67 

2.274 

14 

2.092 

0.732 

0.939 

0.093 

-9.498965 

42.132738 

5.212 

68 

2.332 

16 

1.793 

0.699 

0.902 

0.065 

-9.707684 

49.439690 

6.645 

69 

0.816 

8 

2.192 

0.737 

0.812 

0.000 

-9.958995 

24.673699 

2.834 

70 

1.478 

9 

2.346 

0.885 

0.966 

0.083 

-9.512320 

-30.257131 

1.873 

71 

2.994 

18 

1.913 

0.670 

0.725 

0.146 

-8.597273 

-29.701343 

0.593 

72 

1.351 

13 

1,789 

0.633 

0.783 

0.048 

-9.618662 

-11.036978 

1.453 

73 

1.221 

14 

1.685 

0.593 

0.845 

0.000 

-9.519593 

24.038959 

3.243 

74 

2.855 

19 

1.809 

0.670 

0.738 

0.118 

-8.322206 

14.345758 

1.347 

75 

3.130 

17 

1.674 

0.786 

0.953 

0.117 

-9.907587 

57.088597 

7.715 

76 

1.759 

14 

1.780 

0.558 

0.624 

0.028 

-8.898246 

44.312986 

2.417 

77 

2.390 

14 

2.079 

0.760 

0.939 

0.103 

-9.995923 

44.945430 

7.318 

78 

2.348 

16 

1.843 

0.699 

0.902 

0-065 

-10.065351 

48.997787 

5.907 

79 

2.391 

16 

1.760 

0.656 

0.836 

0.065 

-10.153390 

48.597189 

7.636 

80 

2.300 

15 

1.714 

0.655 

0.884 

0.083 

-9.466774 

90.375028 

1.894 

81 

2.975 

17 

1.775 

0.705 

0.773 

0.167 

-8.668864 

-51.583170 

2.233 

82 

1.851 

11 

2.471 

0.863 

0.938 

0.070 

-10.795945 

14.958329 

5.163 

83 

1.292 

11 

1.932 

0.807 

0.975 

0.025 

-9.250508 

61.289442 

2.564 

84 

2.136 

14 

1.751 

0.715 

0.981 

0.057 

-8.650669 

70.561209 

2.432 

85 

1.478 

9 

2.346 

0.738 

0.875 

0.083 

-9.338439 

12.337686 

1-935 

86 

2.180 

15 

1.741 

0.606 

0.935 

0.057 

-8.832492 

56.103853 

1.663 

87 

1.700 

14 

1.820 

0.611 

0.869 

0.028 

-8.581538 

44.585899 

2.808 

88 

2.300 

15 

1.714 

0.617 

0.896 

0.083 

-9.168383 

66.520403 

1216 
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Table  17-5  continued 

Nr. 

*Xsc 

Po 

j 

SIC2 

SIC  4 

Ehomoi 

AH/ 

89 

2.293 

14 

1.786 

0,708 

0.962 

0.091 

-8.617125 

69.956608 

1.276 

90 

2.357 

15 

1.760 

0.587 

0.787 

0.103 

-9.179235 

64.230081 

1.689 

91 

2.209 

14 

1.800 

0.708 

0.962 

0.082 

-8.497152 

66.236222 

1.211 

92 

3.175 

19 

1.575 

0.553 

0.913 

0.124 

-8.830777 

100.875189 

1.130 

93 

3.110 

17 

1.677 

0.577 

0.890 

0.112 

-8.958369 

70.826740 

1.287 

94 

3.721 

21 

1.867 

0.638 

0.674 

0.263 

-8,315255 

10.633206 

1.225 

95 

2.497 

16 

1.664 

0.644 

0.755 

0.049 

-9.634497 

124.742897 

0.004 

Using  the  same  set  of  aromatic  amines  Debnath  et  al.  (1992  )  developed  various  QSAR 
models  using  hydrophobicity  (log  P,  octanol/water),  Ehomo-  ^nd  Ehj^q  calculated  by  the 
AMI  Hamiltonian  and  some  indicator  variables.  For  the  largest  subset  (n  =  88),  they 
derived  the  following  model: 

In  (R)  =  7.20  +  1.08(log  P)  +  1.28(Ehomo)  -  0.73(Elomo)  + 1-46(10  (17-14) 

j  =  0.860,  F=  12.6,  0.806 

The  model  in  Equation  17-10  is  comparable  to  the  model  developed  by  Debnath  et  al. 
(1992)  and  uses  all  the  95  aromatic  amines  as  compared  to  a  smaller  subset  (n=88)  used 
in  their  study. 
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The  characterization  of  molecular  structure  using  structural  invariants  has 

over  the  last  ten  years.  Specifically,  topological  indices  have  become  more  widely  u^d  in  the 

quantification  of  molecular  structure  for  use  in  quantitative  sir^ure-activity 

Ldies.  chemical  documentation,  and  molecular  similarity  studies.  The  basis. 

utility  of  topological  indices  has  been  examined,  with  an  eye  to  the  specific  advantage  and 

problems  in  their  use.  In  addition,  variable  clustering  and  principal  component  analysis  are 

examined  as  two  potential  solutions  to  the  problem  of  index  intercorrelation. 


Keywords  Topological  indices;  molecular  structure;  graph  theory;  graph  invariants;  variable 
clustering;  pnncipal  component  analysis 


INTRODUCTION 

An  imporiant  area  of  research  in  computational  and  mathematical 
chemistry  is  the  characterization  of  molecular  structure  using  structural 
invariants  (1  - 14].  The  impetus  for  this  research  trend  comes  from  various 
directions.  Researchers  in  chemical  documentation  have  searched  for  a  set 
of  invariants  which  will  be  more  convenient  than  the  adjacency  matrix  (or 
connection  table)  for  the  storage  and  comparison  of  chemical  structures 
(15).  Invariants  have  been  used  to  order  sets  of  molecules  [3-5,  8,  16).  With 
the  substantial  increase  in  available  databases  of  chemical  structures  and 
properties,  attempts  have  been  made  to  develop  structure-activity  relation- 
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ships  (SARs)  whereby  existing  molecules  can  be  compared  with  other 
molecules  (real  or  hypothetical)  on  the  basis  of  these  structural  invariants. 
The  properties  of  the  molecules  of  interest  can  then  be  predicted  based  on 
molecular  structure  without  the  need  for  experimental  data. 

In  this  age  of  combinatorial  chemistry  thousands  of  molecules  of  known 
structure  can  be  produced  rapidly.  However,  at  the  same  time  resources  for 
determining  even  the  simplest  properties  of  all  these  molecules  in  the 
laboratory  are  unavailable.  In  the  USA,  the  Toxic  Substances  Control  Act 
(TSCA)  Inventory  includes  nearly  74,000  chemicals  and  the  list  is  growing  at 
a  rate  of  more  than  2,000  new  submissions  to  the  United  States 
Environmental  Protection  Agency  (USEPA)  for  the  Premanufacture 
Notification  (PMN)  process  per  year  [17-20].  At  present,  risk  assessment 
of  the  PMN  chemicals  is  carried  out  using  limited  test  data.  For  example, 
approximately  15%  of  PMN  submissions  have  empirical  mutagenicity  data. 
Under  such  circumstances,  structural  descriptors  will  play  a  pivotal  role  in 
comparing  molecules  with  one  another  and  in  predicting  their  properties. 

MOLECULAR  STRUCTURE  -  BEAUTY  IN  THE  EYE 
OF  THE  BEHOLDER  OR  CONUNDRUM? 

The  main  hurdle  to  the  characterization  of  molecular  structure  is  the  lack  of 
uniformity  in  its  definition  and  quantification.  The  term  molecular  structure 
represents  a  set  of  nonequivalent  and  probably  disjoint  concepts  [21].  For 
example,  the  term  “molecule"  means  different  things  when  it  represents  an 
assembly  of  identifiable  atoms  held  together  by  fairly  rigid  bonds  as 
compared  to  a  collection  of  delocalized  nuclei  and  electrons  in  which  all 
identical  particles  are  indistinguishable  [21].  There  is  no  reason  to  believe 
that  when  we  discuss  diverse  topics  (e.g.,  chemical  synthesis,  reaction  rates, 
spectroscopic  transitions,  reaction  mechanisms,  and  ab  initio  calculations) 
using  the  notion  of  molecular  structure,  that  the  different  meanings  we  attach 
to  this  term  originate  from  the  same  fundamental  concept  [21,  22].  This 
fundamental  problem  has  been  described  succinctly  by  Woolley  [22]: 

“  •  there  is  no  reason  to  suppose  that  the  same  basic  idea  can 

provide  a  basis  for  the  discussion  of  all  molecular  experiments. 

This  is  understandable  if  one  recognizes  that  every  physical 
and  chemical  concept  is  only  defined  with  respect  to  a  certain 
class  of  experiments,  so  that  it  is  perfectly  reasonable  for 
different  sets  of  concepts,  although  mutually  incompatible,  to 
be  applicable  to  different  experiments." 
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In  the  context  of  molecular  science,  the  various  concepts  of  molecular 
structure  (e.g.,  classical  valence  bond  representation,  various  chemical 
grapt-theoreu;  representations,  the  ball-and-stick  model,  -presen-^ 
minimum  energy  conformation,  semi-symbolic  contour  m  p  ,  y 
representation  by  Hamiltonian  operators)  are  distinct  molecular  modeU 
tod  through  different  means  of  abstraction  from  the  same  chemica 
reality  or  molecule  [23].  In  each  instance,  the  equivalence  class  (concept  or 
model  of  molecular  structure)  is  generated  by  selecting  certam  aspects  while 
Zo  mg  other  unique  properties  of  those  actual  events.  This  explains  the 
oTurrUty  of  the  clcepts  of  molecular  structure  and  their  autonomous 
Lture,  the  word  autonomous  being  used  in  the  sense  that  one  concept  is  not 
logically  derived  from  the  other. 


GRAPHS  AND  MOLECULAR  STRUCTURE 

At  the  most  fundamental  level,  the  structural  model  of  an  assembled  entity 

leg  a  molecule  consisling  of  atoms)  may  be  defined  “  ^ 

,clafionsh,p  among  Us  pans  as  d.sunc,  from  tfie  values 

|241.  Constitutional  formulae  of  molecules  are  grap 

represent  the  set  of  atoms  and  edges  represent  chemical  bonds  [25],  The 

ZZ  or  connecledness  of  aloms  ,n  a  molecule  .s  »_» 

constUutional  graphs.  A  graph  (more  correctly  a  non.d, reeled  graph)  C 

w  E\  consists  of  a  finite  nonempty  set  V  of  points  together  with 

prescribed  set  E  of  unordered  pairs  of  distinct  points  of  ^ 

nwdcl  assigns  to  the  points  of  C  a  realization  in  some  applied  field  and  each 

element  of  E  indicates  a  pair  of  entities  (elements  of  t  e  struc  ura 

wh^ch  are  in  the  finite  nonempty  irreflexive  symmetric  binary  relation 

described  by  C.  For  example,  when  elements  of  the  set  E  symbolize  atomic 

cores  without  valence  electrons  and  the  elements  of  £ 

iwo-electron  bonds,  G  is  the  molecular  graph  or  ^  ^ 

covalent  chemical  species.  Such  a  graph  can  represent  structural 

a  large  number  of  organic  compounds.  Since  more  than  90  /.  of 

compounds  described  so  far  are  cither  organic  or  contain  organic  ligands 

such  a  graph  has  been  found  to  be  useful  m  chemistry  [1 3].  e  e  ge 

not  always  represent  a  covalent  bond.  In  fact,  elements  of  £  may  sv^^ofiz 

almost  any  type  of  bond  (e.g..  ionic,  coordinate,  hy  '‘O^' ^ 

representing  transition  states  of  an  SN.  reaction,  etc.)  l27-  291_lf  the 

interaction  between  a  pair  of  atoms  is  asymmetric  (e.g  m 

plr  covalent  bonds,  hydrogen  bond  donor  acidity,  hydrogen  bond 
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acceptor  basicity,  or  charge  transfer  complex  formation)  the  bonding 
pattern  can  be  represented  by  a  binary  relation  which  is  anti-reflexive  and 
asymmetric  [6],  Further  refinement  could  be  achieved  through  the  assign¬ 
ment  of  weights  to  the  vertices  or  edges  [3],  and  use  of  multiple  edges 
between  a  pair  of  atoms  held  together  both  by  sigma  and  pi  bonds.  The 
weighted  pseudograph  appears  to  be  the  most  general  model  capable  of 
symbolizing  the  bonding  pattern  of  a  large  number  of  organic  and  inorganic 
chemicals. 

For  a  long  time,  chemists  have  relied  on  visual  perception  to  relate 
various  aspects  of  constitutional  graphs  to  observable  phenomena.  The 
power  of  graph-theoretic  formalism  in  chemistry  is  evident  from  its 
successful  applications  in  chemical  documentation,  isomer  discrimination 
and  characterization  of  molecular  branching,  enumeration  of  constitutional 
isomers  associated  with  a  particular  empirical  formula,  calculation  of 
quantum  chemical  parameters,  structure-physicochemical  property  correla¬ 
tions,  and  chemical  structure-biological  activity  relationships  [30-37], 


GRAPHS  AS  MOLECULAR  MODELS 

Any  concept  of  molecular  structure  is  a  hypothetical  sketch  of  the 
organization  of  atoms  within  the  molecule.  Such  a  mode!  object  is  a  general 
theory  and  remains  empirically  untestable.  A  model  object  has  to  be  grafted 
to  a  specific  theory  to  generate  a  theoretical  model  which  can  be  empirically 
tested  [38].  For  example,  when  it  was  suggested  by  Sylvester  in  1878  [39]  that 
the  structural  formula  of  a  molecule  is  a  special  kind  of  graph,  it  was  an 
innovative  general  theory  without  any  predictive  potential.  When  the  idea  of 
combinatorics  was  applied  on  chemical  graphs  (model  object),  it  could  be 
predicted  that  “there  should  be  exactly  two  isomers  of  butane  (C4Hi())'‘ 
because  “there  are  exactly  two  tree  graphs  with  four  vertices"  when  one 
considers  only  the  non-hydrogen  atoms  present  in  C4H10  [13].  This  is  a 
theoretical  model  of  limited  predictive  potential.  Although  it  predicts  the 
existence  of  chemical  species,  given  a  set  of  molecules  (e.g.,  isomers  of 
hexane  [C6H14])  the  model  is  incapable  of  predicting  any  properties  for 
these  molecules.  This  is  due  to  the  fact  that  any  empirical  property  P  maps  a 
set  of  chemical  structures  into  the  set  R  of  real  numbers  and  thereby  orders 
the  set  empirically.  Therefore,  to  predict  the  property  from  structure,  wc 
need  a  nonempirical  (structural)  ordering  scheme  which  closely  resembles 
the  empirical  ordering  of  structures  as  determined  by  P.  This  is  a  more 
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specific  .heo,«.cal  model  based  on  the  same  model  objee.  ^hemleal  graph) 
and  can  be  accomplished  by  using  specific  graph  invananKs). 

CHARACTERIZATION  OF  MOLECULAR  GRAPHS 

Molecular  graphs  can  be  characterized  by  graph  invariants.  A  graph 
^vtUlt  is  a  graph-theoretic  property  which  is  preserved  by  isomorphism 
[261  A  graph  invariant  could  be  a  polynomial,  a  sequence  of  numbers,  or  a 
n  le  n'r^r.  The  characteristic  polynomial  of  a  graph  and  spectra  of 
graphs  are  graph  invariants.  Numerical  graph  invariants  derived  ^om 
Lkcular  graphs  are  called  graph-theoretic  indices  or  topological  indias 
pS]  Topological  indices  quantitatively  describe  molecular  ^ 

sensitive  to  such  structural  attributes  as  size,  shape,  patterns  of  branching, 

ro‘S-n"  deHved  conveniently  from 

different  Itriccs  such  as  the  adjacency  matrix  and  the  d.stan- ^ 
origins  of  such  TIs  illuminate  the  fundamental  structural  features  that  they 
quantify  On  the  other  hand,  some  indices  are  derived  to  quantify  a  key 
structural  feature  which  is  qualitative  and  only  understood  intuitively, 
deriving  his  original  connectivity  index  ('A'),  Randic  asked  the  ques  ion. 
licTof  the  two  heptane  isomers,  vi.-.,  3-methylhexane  and  3-thylpenta^ 
is  more  branched  (9).  Until  that  time,  branching  was  understood  o  > 
intuitively  Randic  derived  a  quantitative  description  of  branching  based  on 
he  graitheoretic  treatment  of  the  structures.  In  addition,  information 
iheor"'  indices  of  chemical  structures  have  been  derived  to  answer  th 
ouestion  which  of  a  collection  of  srruclutes  is  more  complex 
;«ergen.oos7  DIfieren,  measures  of  molecular 

answer  ihis  queslion  from  rUffetenl  pomls  of  view  (401.  In  <1''  « 

seerZ  we  diluss  .he  sriucrural  basis  and  method  of  calculation  for  some  of 

the  major  topological  indices. 


CALCULATION  OF  TOPOLOGICAL  INDICES 

The  Wiener  index  {W)  (41).  the  first  topological  index  ^ 

chemical  literature,  may  be  calculated  from  the  distance  matrix  D{C)  of 
hydrogen-suppressed  chemical  graph  O'  as  the  sum  of  the  entries  in  “\^PP^ 
u^rgular  disfance  submatrix.  The  distance  matrix  Z)(C)  of  a  nondirec  d 
graph  G  with  n  vertices  is  a  symmetric  n  x  n  matrix  (c/^),  w  ere  ,j  is  equ 


6  S.  C.  BASAK  AND  B.  D.  CUTE 

the  distance  between  vertices  v,  and  Vy  in  G.  Each  diagonal  element  da  of 
D{G)  is  zero.  We  give  below  the  distance  matrix  Z)(C|)  of  the  unlabeled 
hydrogen-suppressed  graph  G\  of  2,3‘dimethylhexane  (Fig.  1); 

(1)  (2)  (3)  (4)  (5)  (6)  (7)  (8) 

1  r  0  1  2  2  3  3  4  5  ■ 

2  10  112  2  3  4 

3  2  1  0  2  3  3  4  5 

4  2  1  2  0  112  3 

5  3  2  3  1  0  2  3  4 

6  3  2  3  12  0  1  2 

7  4  3  4  2  3  10  1 

8  [  5  4  5  3  4  2  1  0  . 

W  is  calculated  as: 

1F=  l/2^^f,7  =  52A-gA  (1) 

ij  h 

where  gh  is  the  number  of  unordered  pairs  of  vertices  whose  distance  is  h. 
Thus  for  Z)(G|),  iV  has  a  value  of  seventy. 

Randic’s  connectivity  index  (9),  and  higher-order  connectivity  path,  clu¬ 
ster,  path-cluster  and  chain  types  of  simple,  bond  and  valence  connectivity 
parameters  were  calculated  using  the  method  of  Kier  and  Hall  [10].  P/, 
parameters,  number  of  paths  of  length  /i(/i  =  0, 1, . . . ,  10)  in  the  hydrogen- 
suppressed  graph,  are  calculated  using  standard  algorithms. 

Balaban  defined  a  series  of  indices  based  upon  distance  sums  within 
the  distance  matrix  for  a  chemical  graph  which  he  designated  as  J  indices 
[42-44].  These  indices  are  highly  discriminating  with  low  degeneracy. 
Unlike  tV,  the  J  indices  range  of  values  are  indepjendent  of  molecular  size. 


3  5 


1  2  4  6  7  8 


FIGURE  1  Hydrogen-suppressed  graph  of  2,3-dimethylhexane 


TOPOLOGICAL  CHARACTERIZATION  OF  MOLECULES  ^ 

Information-theoretic  topological  indices  are  calculated  by  the 
tion  of  information  theory  on  chemical  graphs.  An  appropriate  set  o  o  n 
elements  is  derived  from  a  molecule  graph  G  depending  upon  certain 
structural  characteristics.  On  the  basis  of  an  equivalence  relation  defined  on 
A,  the  set  A  is  partitioned  into  disjoint  subsets  Ai  of  order  n,(( 

= '*)•  ^  probability  distribution  is  then  assigned  to  the 

set  of  equivalence  classes: 


/4i,  A2^  >  ^  ,Ah 
P\^  P2^  -  -  yPh 


where  p.  =  njn  is  the  probability  that  a  randomly  selected  element  of  A  will 
occur  in  the /th  subset.  .  ,  -  *  u  ou  ^  'o 

The  mean  information  content  of  an  element  of  A  is  defined  by  ann  n 
relation  [451: 


h 

IC  =  -  ^  P^  log? 


(2) 


The  logarithm  is  taken  at  base  2  for  measuring  the  information  content  in 
bus  The  total  information  content  of  the  set  A  is  then  nxlC. 

It  is  to  be  noted  that  the  information  content  of  a  graph  G  is  not  uniquely 
defined.  It  depends  on  how  the  set  A  is  derived  from  G  as  well  as  on  the 
equivalence  relation  which  partitions  A  into  disjoint  subsets  .4^  or 
example,  when  A  constitutes  the  vertex  set  of  a  chemical  graph  G,  two 
methods  of  partitioning  have  been  widely  used; 

a)  Chromatic-number  coloring  of  G  where  two  vertices  of  the  same  color 

are  considered  equivalent,  and  r  ^  v,  r 

b)  Determination  of  the  orbits  of  the  automorphism  group  of  G  thereafter 
vertices  belonging  to  the  same  orbit  are  considered  equivalent. 


Rashevsky  was  the  first  to  calculate  the  information  content  of  graphs 
where  “topologically  equivalent"  vertices  were  placed  in  the  same 
equivalence  class  146).  In  Rashevsky's  approach,  two  vertices  u  and  v  of  a 
graph  are  said  to  be  topologically  equivalent  if  and  only  if  for  each 

neighboring  vertex  «,(/=!. 2 . k)  of  the  vertex  «,  there  is  a  distinct 

neighboring  vertex  v,  of  the  same  degree  for  the  vertex  v.  While  Rashevsky 
used  simple  linear  graphs  with  indistinguishable  vertices  to  symbolize 
molecular  structure,  weighted  linear  graphs  or  multigraphs  are  etter 
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models  for  conjugated  or  aromatic  molecules  because  they  more  properly 
reflect  the  actual  bonding  patterns,  i.e.,  electron  distribution. 

To  account  for  the  chemical  nature  of  vertices  as  well  as  their  bonding 
pattern,  Sarkar  et  al.  [47]  calculated  information  content  of  chemical  graphs 
on  the  basis  of  an  equivalence  relation  where  two  atoms  of  the  same  element 
are  considered  equivalent  if  they  possess  an  identical  first-order  topological 
neighborhood.  Since  properties  of  atoms  or  reaction  centers  are  often 
modulated  by  stereo-electronic  characteristics  of  distant  neighbors,  i.e., 
neighbors  of  neighbors,  it  was  deemed  essential  to  extent  this  approach  to 
account  for  higher-order  neighbors  of  vertices.  This  can  be  accomplished  by 
defining  open  spheres  for  all  vertices  of  a  chemical  graph.  If  r  is  any  non- 
negative  real  number  and  v  is  a  vertex  of  the  graph  G,  then  the  open  sphere 
5(  V,  r)  is  defined  as  the  set  consisting  of  all  vertices  v,  in  C  such  that  d{\\  v,)  <  r. 
Therefore,  5(v,  0)  =  </>,  S( v,  r)  =  v  for  0  <  r  <  1 ,  and  5(v,  r)  is  the  set  consisting 
of  V  and  all  vertices  v,  of  C  situated  at  unit  distance  from  v,  if  1  <  r  <  2. 

One  can  construct  such  open  spheres  for  higher  integral  value  of  r.  For  a 
particular  value  of  r,  the  collection  of  all  such  open  spheres  S{\\  r)  where  v 
runs  over  the  whole  vertex  set  K,  forms  a  neighborhood  system  of  the 
vertices  of  G.  A  suitably  defined  equivalence  relation  can  then  partition  I' 
into  disjoint  subsets  consisting  of  vertices  which  are  topologically  equivalent 
for  rth  order  neighborhood.  Such  an  approach  has  been  developed  and  the 
information-theoretic  indices  calculated  based  on  this  idea  are  called  indices 
of  neighborhood  symmetry  [40]. 

In  this  method,  chemicals  are  symbolized  by  weighted  linear  graphs.  Two 
vertices  uq  and  vq  of  a  molecular  graph  are  said  to  be  equivalent  with  respect 
to  rth  order  neighborhood  if  any  only  if  corresponding  to  each  path  wo. 
wi — of  length  r,  there  is  a  distinct  path  vq,  vj,...,  of  the  same  length 
such  that  the  paths  have  similar  edge  weights,  and  both  uq  and  vq  are 
connected  to  the  same  number  and  type  of  atoms  up  to  the  rth  order 
bonded  neighbors.  The  detailed  equivalence  relation  has  been  described  in 
earlier  studies  [40,  48]. 

Once  partitioning  of  the  vertex  set  for  a  particular  order  of  neighborhood 
is  completed,  IC,  is  calculated  by  Eq.  2.  Basak  et  al.  [49]  defined  another 
information-theoretic  measure,  structural  information  content  (SIQ),  which 
is  calculated  as: 


SICr  =  ICr/  logi  n  (3) 

where  IC^.  is  calculated  from  Eq.  2  and  n  is  the  total  number  of  vertices  of 
the  graph. 
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Another  information-theoretic  invariant,  complementary  information 
content  (CIO  [50],  is  defined  as; 

ClCr  =  logj/i  -  ICr  (^) 

CIC,  represents  the  difference  between  maximum  possible  complexity  of  a 
graph  (where  each  vertex  belongs  to  a  separate  equivalence  class)  and  the 
realized  topological  information  of  a  chemical  species  as  defined  by  ICr- 
In  Figure  2,  the  calculation  of  IC,,  SIC,  and  CIC,  is  demonstrated  for  the 
hydrogen-filled  graph  of  2.3-dimethylhexane. 

The  information-theoretic  index  on  graph  distance,  /p  is  calculated  from 
the  distance  matrix  D{G)  of  a  chemical  graph  G  as  follows  [1 1): 


Lubeled  Graph: 

Hi  Ct  C5  H5  Hj  H|o 

i  r  I  I  I  I 

H I  -(i  1  -C2-C4-C6-C7-Cg-H  ,2 

I  I  I  I  I  I 

H4  Hi  H7  H^  H|| 


Kirsl  Order  Neighborhoods: 

M,  H,s 


Subsets: 


C,  C2,4 

/A\  A  A 


h'hhc.hccc,hhcc 

IV 


c  c 

I  II  Ill 

(H,.ig)  (C,,ij.g)  (C2.4) 

Probability  (p/):  '8/26  4/26  2/26 


(Co,?) 

2/26 


IC,  =-iP(*iog2r, 

=  2  •  2/26  •  log2  26/2  +  4/26  •  log2  26  +  1 8/26  •  log2  26/ 1 8 
=  1 .150  biis 

SlC|  =  IC,/log2  26 
=  0.353  bus 

CIC|  =  log2  2(>-  IC, 

=  2  lOK  bus 

FIGURE  2  The  calculation  of  IC,.  SIC,  and  CIC,  based  on  the  first  order  neighborhoods  for 
ihe  labeled  graph  of  2.3-dimclhylhcxanc 
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I'l,  =W\og,W-'£^gh,  h\og^h  (5) 

The  mean  information  index,  /{}'.  is  found  by  dividing  the  information 
index  /”'  by  W.  The  information  theoretic  parameters  defined  on  the 
distance  matrix,  //^  and  were  calculated  by  the  method  of 

Raychaudhury  n/.  [12]. 


THEORETICAL  METHODS 
Databases  and  Calculations 

Two  data  sets  were  used  for  this  study;  the  first  consists  of  the  seventy-four 
alkanes  (C2-C9)  and  the  second,  more  heterogeneous  set  was  taken  from  the 
STARLIST  group  of  chemicals  [51).  The  STARLIST  subset  includes  219 
chemicals  for  which  HBi  was  equal  to  zero  and  calculated  log  P  values  fell  in 
the  range  of  -2  to  5.5.  HBi  is  a  measure  of  the  hydrogen  bonding  potential 
of  a  chemical.  Chemical  structures  for  these  compounds  were  encoded  using 
the  SMILES  line  notation  for  chemical  structures  and  entered  into  the 
computer  program  POLLY  version  2.3  for  the  calculation  of  indices  [52]. 
Table  1  provides  a  comprehensive  list  and  brief  descriptions  for  these 
indices. 


STATISTICAL  METHODS 

Initially  all  TIs  were  tranformed  by  the  natural  logarithm  of  the  index  plus 
one.  This  is  routinely  done  to  scale  the  indices  since  there  may  be  a 
difference  of  several  orders  of  magnitude  between  indices  and  some  may 
equal  zero. 

From  the  original  sets  of  102  indices  calculated  for  both  data  sets,  it  was 
necessary  to  remove  some  indices.  Some  of  the  indices  for  the  set  of  alkanes 
(e  g.,  the  simple,  valence  and  bond  connectivity  indices)  were  completely 
redundant.  Other  indices  were  removed  because  they  had  values  of  zero  for 
all  compounds.  This  ‘  cleaning"  of  the  sets  of  TIs  left  fifty-three  indices  for 
the  alkanes  and  ninety-eight  indices  for  the  STARLIST  set. 

Variable  clustering  and  principal  component  analysis  were  used  on  the 
remaining  indices  to  minimize  problems  of  intercorrelation  amongst  the 
indices.  The  variable  clustering  was  conducted  using  the  SAS  procedure 
VARCLUS  which  divides  the  indices  into  disjoint  clusters  which  are 
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TABLE  I  Symbols  and  definitions  of  topological  indices 

Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of 
vertices  of  a  graph 

Mean  information  index  for  the  magnitude  of  distance 

Wiener  index  =  half-sum  of  the  ofT-diagonal  elements  of  the  distance  matrix 

of  a  graph 
Degree  complexity 
Graph  vertex  complexity 

Graph  distance  complexity  ^  .  r  r 

Information  content  of  the  distance  matrix  partitioned  by  frequency  ol 

occurrences  of  distance  h  j  u  .  • 

Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  lU 

maximum  neighborhood  of  vertices  ,  r  .c 

Order  of  neighborhood  when  IC,  reaches  its  maximum  value  for  the 

hydrogen-filled  graph  ,, 

A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 
A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all 

neighboring  (connected)  vertices  .  .  ^ 

Mean  information  content  or  complexity  of  a  graph  based  on  the  r  (r  0  6) 

order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 

Structural  information  content  for  (r  =  0-6)  order  neighborhood  of  vertices 

in  a  hydrogen-filled  graph  .  ,  .  .  j  r 

Complementary  information  content  for  r‘^  (r  =  0-6)  order  neighborhood  of 

vertices  in  a  hydrogen-filled  graph 
Path  connectivity  index  of  order  /i  =  0-6 
Cluster  connectivity  index  of  order  h  -  3-6 
Chain  connectivity  index  of  order  h  -  3-6 
Path-cluster  connectivity  index  of  order  A  =  4  ~  6 
Bond  path  connectivity  index  of  order  A  0-6 
Bond  cluster  connectivity  index  of  order  A  ^  3  -  6 
Bond  chain  connectivity  index  of  order  A  -  3-6 
Bond  path-cluster  connectivity  index  ot  order  A  =  4-6 
Valence  path  connectivity  index  of  order  A  =  0-6 
Valence  cluster  connectivity  index  of  order  A  =  3-6 
Valence  chain  connectivity  index  of  order  A  "  _ 

Valence  path-cluster  connectivity  index  of  order  A -4-6 

Number  of  paths  of  length  A  =  0-  10 

Balaban's  J  index  based  on  distance 

Balaban's  J  index  based  on  bond  types 

Balaban's  J  index  based  on  relative  clccironegaiivitics 

Balaban's  J  index  based  on  relative  covalent  radii  _ _ 


essenuall>  unidimcnsional  based  on  the  correlation  matrix  [53].  From  each 
cluster,  the  index  which  was  most  correlated  with  the  cluster  was  selected  as 
the  best  representative  of  that  cluster.  In  this  way.  individual  indices  are 
retained  while  minimi/ing  intcrcorrclations.  This  procedure  resulted  in  the 
retention  of  eight  TIs  for  the  alkanes:  //‘ .  SlCo,  S1C|,  SIC4,  A'o  ^4. 

/’x;  and  twelve  Tls  for  the  STARLIST  data; /fl ,  IC4,  SIC3,  CIC|.  A'.  A'ch- 

(,  t->  'v'-  Pu  T1  values  for  a  subset  of  the  alkanes,  the 

C  h*  C  ’  i  '  *  K* 

LMghlcen  ocianc  isomers,  are  presenlcd  in  Table  11. 
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The  principal  component  analysis  (PCA)  was  accomplished  using  the 
SAS  procedure  PRINCOMP  [54).  The  PCA  produces  linear  combinations 
of  the  TIs,  called  principal  components  (PCs)  which  are  derived  from  the 
correlation  matrix.  The  first  PC  has  the  largest  variance,  or  eigenvalue,  of 
the  linear  combination  of  TIs.  Each  subsequent  PC  explains  the  maximal 
index  variance  orthogonal  to  previous  PCs,  eliminating  the  redundancy 
which  can  occur  with  TIs.  The  maximum  number  of  PCs  generated  is  equal 
to  the  number  of  individual  TIs  available.  For  the  purposes  of  this  study, 
only  PCs  with  eigenvalues  greater  than  one  were  retained.  A  more  detailed 
explanation  of  this  approach  has  been  provided  in  a  previous  study  by 
Basak  et  al.  (3).  The  seven  PCs  with  eigenvalues  greater  than  one  and  the  ten 
PCs  with  eigenvalues  greater  than  one  were  retained  for  the  alkanes  and 
STAR-LIST  set  respectively.  Table  HI  presents  the  PCs  for  the  octane 
isomers,  a  subset  of  the  seventy-four  alkanes. 


DISCRIMINATION  OF  ISOMERS  USING  TOPOLOGICAL 
INDICES  AND  PRINCIPAL  COMPONENTS 
DERIVED  FROM  THEM 


Topological  aspects  of  chemicals  have  been  used  in  chemical  documenta¬ 
tion.  One  line  of  research  in  this  area  has  been  the  development  of 


TABLE  I!  TIs  selccicd  by  variable  clustering  of  the  alkanes  (octane  isomers  listed) 


Isomer  Name 

tv 

5/C'o 

5/C, 

5/C4 

'Ac 

'av 

Pa 

P^ 

Octane 

1  288 

0  173 

0  218 

0.477 

0.000 

0.000 

2 

0 

2'mcthylhcpiane 

1.233 

0  173 

0  248 

0.561 

0.342 

0.000 

2 

0 

3-mcthylhcpianc 

1  228 

0  173 

0.248 

0.598 

0.254 

0.000 

2 

0 

4-mcihylhcptanc 

1.215 

0  173 

0.248 

0.503 

0.254 

0.000 

0 

3-cihylhcxanc 

1.177 

0  173 

0.248 

0  532 

0.186 

0.000 

2 

0 

2.2-dimcthylhcxanc 

I  157 

0  17.3 

0.248 

0.495 

0.940 

0.000 

2 

0 

2,3-dimcthyihcxanc 

1.170 

0.173 

0.253 

0.557 

0.450 

0.212 

0 

2.4>dimcihylhcxanc 

1  171 

0  173 

0.253 

0.557 

0.529 

0.000 

2 

0 

2.5‘dimcihylhcxane 

I  183 

0  173 

0.253 

0.384 

0.597 

0.000 

T 

0 

3»3-dimclhylhcxanc 

1  137 

0  173 

0.248 

0.548 

0.792 

0.000 

2 

0 

3, 4-di  methyl  hexane 

1  157 

0  173 

0.253 

0.469 

0.386 

0.154 

2 

0 

3-ethyI-2-melhylpentanc 

1  096 

0  173 

0.253 

0.490 

0.405 

0.154 

2 

0 

3-ethyI-3-meihylpcntanc 

I  073 

0  173 

0.248 

0.421 

0.656 

0.000 

1 

0 

2.2.3-tnmcthy)pcnlanc 

1  075 

0  173 

0.255 

0490 

0.944 

0.477 

1 

0 

2.2.4-inmeihylpcntanc 

1  083 

0  173 

0,255 

0.450 

1.088 

0.000 

T 

0 

2.3,3-trimethylpenlane 

1  065 

0  173 

0.255 

0.506 

0.850 

0.529 

1 

0 

2.3,4-inmelhylpentanc 

1  097 

0.173 

0.225 

0.413 

0.620 

0.326 

2 

0 

2.2.3.3-ietramcthylbutanc 

0  997 

0.173 

0.218 

0  218 

1.253 

1.179 

0 

0 
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TABLE  in  Values  of  the  first  seven  PCs  for  the  eighteen  octane  isomers 


Isomer  Name 
Octane 

2- methyIhepiane 

3- methylheplane 

4- melhylhcptane 
3-eihyIhexane 

2.2- dimethylhexane 
2,3.‘dimethylhcxane 

2.4- dimethylhexane 

2.5- dimeihylhexane 

3.3- dimelhylhexane 

3.4- dimethylhexane 
3-cihyl-2-melhylpentane 
3-ethyl-3-methylpeniane 

2.2.3- irimeihyIpentane 
2.2,4'tnmethylpcntane 

2.3.3- trimeihylpentane 

2.3.4- trimcihylpeniane 
2,2,3.3‘ictramethylbutane 


PC^ 

PC: 

PCx 

PC, 

PC, 

PC^ 

PC, 

0.328 

-1.744 

5.807 

0.602 

-0.320 

-0.473 

-0.433 

2.181 

-4.236 

1.097 

0.386 

1.100 

0.300 

-0.935 

2.817 

-4.857 

-0.307 

0.921 

0.368 

0.366 

-0.513 

1.338 

-2.211 

0.848 

-0.821 

0.005 

-0.541 

-0.904 

1.553 

-2.077 

-0.348 

-0.494 

-0.817 

-0.651 

-0.290 

1.163 

0.007 

-0.436 

-0.878 

1.367 

1.383 

0.638 

2.122 

-2.060 

-1.546 

0.502 

-0.308 

-0.253 

-0.105 

2.089 

-2.306 

-1.372 

-0.289 

-0.205 

0.004 

0.291 

-0.769 

1.340 

1.473 

-2.659 

0.612 

-0.387 

- 1 .443 

2.044 

-0.573 

-1.726 

0.303 

0.173 

0.582 

1.163 

0.807 

0.22S 

-0.825 

-0.696 

-0.730 

-1.223 

-0.545 

0.991 

-0.035 

-1.596 

-0.672 

-1.076 

-1.438 

O.IIO 

-0.035 

2.870 

-0.614 

-0.909 

-0.497 

-1.178 

0.271 

1.136 

2.191 

-2.383 

1.277 

0.465 

-0.075 

0.548 

0.377 

IMl 

-1.284 

- 1 .846 

0.726 

0.461 

1 .676 

1.318 

1.825 

-2.717 

1.990 

0.318 

-0.400 

0.251 

-0.548 

4.168 

1.329 

0.020 

-1.745 

-1.140 

-0.039 

-4.473 

12.522 

2.681 

4.256 

1.345 

-0.129 

-2.627 

topological  indices  which  are  more  discriminatory.  For  example,  the  J  index 
developed  by  Balaban  is  one  of  the  most  discriminatory  indices.  Randic 
developed  the  concept  of  molecular  identification  number  (I.  D.  number)  by 
combining  a  few  topological  aspects  of  structures.  Other  authors  have  used 
more  than  one  index  for  this  purpose.  One  example  is  the  topological 
superindex  proposed  by  Bonchev  et  «/.  [55]  where  they  use  a  collection  of 
indices  as  the  superindex.  Two  structures  are  said  to  be  distinct  if  the 
magnitudes  of  any  one  of  the  component  indices  differ  for  them. 

In  view  of  the  intercorrelation  of  indices  and  the  fact  that  a  large  number 
of  Tls  have  been  defined  in  the  literature,  we  have  been  interested  in  deriving 
orthogonal  parameters  from  Tls.  We  have  employed  two  statistical 
methods;  variable  clustering  and  principal  components  analysis  (PCA).  In 
the  former  method,  we  begin  with  the  Tls  calculated  by  POLLY  and  derive 
a  small  set  of  original  variables  which  are  minimally  intercorrelated.  In  the 
case  of  the  seventy-four  alkanes  the  method  retained  eight  indices.  In  the 
PCA,  seven  principal  components  (PCs)  arc  derived  from  original  variables 
and  these  PCs  are  linear  combinations  of  all  the  Tls.  For  the  STARLIST 
set,  twelve  Tls  were  retained  by  variable  clustering,  while  ten  PCs  were 
derived. 

We  are  inlercsied  lo  see  the  discriminatory  power  of  the  Tls  selected  by 
variable  clustering  us-a-vis  the  PCs.  Values  of  the  Tls  selected  by  the 
variable  clustering  technique  and  the  first  seven  PCs  with  eigenvalue  greater 
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than  1 .0  for  the  set  of  eighteen  octane  isomers  are  presented  in  Tables  !I  and 
III  respectively.  It  is  clear  from  the  data  that  some  individual  TIs  are  not 
sufficiently  discriminatory  for  the  eighteen  octane  isomers.  On  the  other 
hand,  each  PC  is  unique  for  any  given  structure,  making  them  more 
discriminatory  than  any  individual  TI.  In  the  interest  of  space,  the  values  of 
the  TIs  and  PCs  for  all  of  the  alkanes  and  for  the  STARLIST  set  were  not 
included  in  the  tables,  however,  this  information  is  available  upon  request 
from  the  authors. 


TOPOLOGICAL  INDEX  SPACE  VIS-A-VIS 
PC  SPACE:  WHAT  DO  THEY  MEAN? 

Each  TI  quantifies  certain  aspects  of  molecular  structure.  Distinct  indices 
selected  by  the  variable  clustering  procedure  encode  different  information 
regarding  molecular  structure  (model  object).  For  example,  indices  like  the 
connectivity  index  or  Wiener  index  quantify  adjacency  information  of  the 
simple  planar  graph  model  of  molecules.  On  the  other  hand,  information 
theoretic  graph  invariants  quantify  the  degree  of  complexity  of  the 
molecular  graph.  Intuitively,  these  are  distinct  aspects  of  molecular  structure 
and  this  notion  is  borne  out  by  the  result  of  variable  clustering  analysis  on 
the  set  of  TIs  calculated  by  POLLY,  It  is  tempting  to  speculate  that  each 
index  retained  by  variable  clustering  represents  one  distinct  aspect  of 
molecular  architecture  and  that,  collectively,  the  TIs  form  the  structure 
space  of  the  set  of  chemicals.  Such  a  space  can  be  used  for  the  discrimination 
of  structures  and  structure-property  correlation.  The  magnitudes  of  eight 
TIs  for  the  eighteen  octane  isomers  show  that  the  TIs  selected  by  variable 
clustering  have  reasonable  power  for  discriminating  isomeric  structures. 

At  the  level  of  PCs,  wc  have  derived  a  certain  number  of  orthogonal 
variables  using  PCA  of  the  indices.  For  the  alkanes  we  had  seven  PCs  with 
eigenvalues  greater  than  1.0  (Tab.  Ill)  whereas  for  the  structurally  diverse 
set  of  219  compounds  we  had  ten  PCs  with  eigenvalues  greater  than  1.0. 
This  result  indicates  that  the  structure  space  for  the  set  of  219  molecules  is 
more  complex  than  that  for  the  set  of  seventy-four  alkanes.  This  is  in 
agreement  with  our  intuitive  notion  that  molecules  with  heteroatoms  and 
many  functional  groups  are  more  complex  than  molecules  devoid  of  any 
heteroatom.  Finally,  the  pattern  of  correlation  of  the  individual  PCs  with 
the  TIs  can  help  us  in  understanding  the  nature  of  the  axes  derived  by  PCA 
(Tabs.  IV  and  V). 
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DISCUSSION 

The  major  objectives  of  this  paper  were: 

a)  To  illuminate  the  fundamental  nature  of  mathematical  invariants  of 
molecular  structure, 

b)  To  study  the  utility  of  graph  invariants  in  the  characterization  of 
molecular  structure,  and 

c)  To  study  the  intercorrelation  of  indices  and  extraction  of  orthogonal 
variables  from  TIs. 

It  is  clear  from  the  results  presented  in  this  paper  that  the  various  classes 
of  mathematical  invariants  quantify  different  aspects  of  molecular 
architecture.  They  depend  principally  on  the  structural  model  (model 
object)  used  for  the  calculation  of  the  invariant  as  well  as  the  intuitive  aspect 
of  molecular  structure  they  are  used  to  quantify.  For  example,  connectivity 
indices  and  neighbor  complexity  indices  were  designed  to  quantify  distinct 
aspects  of  molecular  structure.  The  results  of  variable  clustering  of  the 
congeneric  set  of  alkanes  and  the  diverse  set  of  219  chemicals  show  that 
these  indices  encode  largely  independent  structural  information  about  these 
molecules. 

Many  structural  schemes  have  been  developed  for  the  derivation  of 
numbers  or  sets  of  numbers  which  can  discriminate  closely  related  structures 
so  that  they  can  be  useful  in  chemical  documentation.  The  results  presented 
m  this  paper  show  that  both  the  collection  of  indices  selected  by  variable 
clustering  as  well  as  the  PCs  can  discriminate  among  the  eighteen  octane 
isomers  (Tabs.  II  -  V ).  It  is  also  clear  from  the  data  that  the  PCs  are  more 
discriminatory  than  the  individual  indices.  For  example,  each  PC  has 
distinct  values  for  all  eighteen  octane  isomers.  PCs  derived  from  TIs  have 
also  been  used  in  the  discrimination  of  isospectral  molecular  graphs  where 
individual  indices  show  a  high  degree  of  degeneracy  [56]. 

Variable  clustering  of  TIs  for  the  set  of  seventy-four  alkanes  retained 
eight  parameters  which  can  be  classified  into  three  subsets: 

a)  H\  Pa  and  which  represent  generalized  size  and  shape; 

b)  SlCu),  SIC|,  and  SIC4  which  quantify  molecular  complexity;  and 

c)  'A'c  and  ^A'c  which  encode  information  about  molecular  branching. 

In  the  case  of  the  more  diverse  set  of  2 19  chemicals,  the  indices  retained  after 
variable  clustering  fall  into  four  subclasses: 

a)  /)V  Pk  and  (general  shape  and  si/e): 

b)  IC4.  S\Cy  and  CICi  (complexity): 


IS 


S  C.  BASAK  AND  B.  D.  CUTE 


c)  ''A'ch  and  (cyclicity);  and 

d)  •'A'c,  (branching). 

A  perusal  of  results  from  both  the  sets  indicate  that  distinct  indices  quantify 
different  intuitive  aspects  of  molecular  structure. 

A  similar  picture  emerges  from  the  principal  component  analysis  of  both 
sets  of  molecules.  The  first  PC  is  strongly  correlated  with  variables  which 
quantify  shape  and  size.  The  next  important  factor  is  molecular  complexity 
which  is  encoded  by  the  second  PC  (Tabs.  (V  and  V).  The  higher  order  PCs 
(3  -  5)  are  strongly  correlated  with  invariants  which  quantify  such  subtle 
structural  factors  as  branching,  cyclicity,  etc.  It  may  be  mentioned  that  such 
a  result  emerged  from  our  earlier  studies  on  a  large,  diverse  set  of  3,692 
chemicals  [3,  57J. 

In  conclusion,  mathematical  invariants  derived  from  chemical  topology 
quantify  different  aspects  of  molecular  architecture  which  are  intuitively 
understood  by  the  chemist.  One  can  create  a  structure  space  from  these 
invariants  taking  uncorrelated  structural  information  (indices  or  PCs).  Such 
orthogonal  factors  can  be  useful  in  the  discrimination  of  closely  related 
structures  like  isomers  and  in  the  creation  of  structure  spaces.  Metrics 
defined  on  such  spaces  have  been  useful  in  the  quantification  of  molecular 
similarity  [3-  5,  58-63).  Orthogonal  variables  derived  by  PCA  or  variable 
clustering  can  also  be  used  in  QSAR  studies  pertaining  to  pharmacology 
and  toxicology  [1,  2,  6,  33-  36.  40,  48-  50,  64-68). 
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parameters;  aquatic  toxicity;  benzene  denvatives 


INTRODUCTION 

Today’s  toxicologist  is  faced  with  a  myriad  of  unknowns.  In  1996 
approximately  1.26  million  new  chemicals  were  registered  ^  Chemica 
Abstract  Service  (CAS),  bringing  the  total  number  of  registered  chemicals  to 
around  15.8  million  (I).  With  such  a  large  number  of  chemicals  being 
registered  yearly,  it  is  impossible  to  test  all  of  them  exhaustively  for  their 
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effects  on  the  cn%'ironment  and  human  health.  Chemicals  can  only  be 
evaluated  as  they  are  called  into  question,  and  for  many  of  these  compounds 
there  will  be  little  or  no  test  data  available.  Therefore,  when  the  issue  of  hazard 
assessment  comes  up,  it  becomes  difficult  at  best  to  provide  any  useful 
suggestions  or  analyses  for  many  of  the  registered  chemicals,  including  some 
which  are  in  commerce  today.  To  complete  the  battery  of  tests  necessary  for 
the  proper  hazard  assessment  of  a  single  compound  is  an  extremely  costly 
procedure  and  there  is  simply  not  enough  time  or  money  to  complete  these  test 
batteries  for  all  compounds  which  are  registered  today  [2].  As  a  result,  when  we 
need  to  evaluate  the  human  health  or  ecological  hazards  posed  by  a  chemical  it 
becomes  ever  more  important  that  we  have  accurate  methods  for  estimating 
the  physicochemical  and  biological  properties  of  molecules. 

Quantitative  structure-activity  relationships  (QSARs)  have  come  into 
widespread  use  for  the  prediction  of  various  molecular  properties  and  bio¬ 
logical  responses.  Traditional  QSARs  use  empirical  properties;  e  g.,  boiling 
point,  melting  point,  octanol-water  partition  coefficient;  or  empirically 
derived  parameters;  e.g.,  linear  free  energy  related  (LFER)  and  linear 
solvation  energy  related  (LSER)  parameters;  for  the  prediction  of  other 
endpoints  [3  -  8)  However,  due  to  the  scarcity  of  available  data  for  the 
majority  of  chemicals  that  need  to  be  evaluated  for  ecoioxicological  risk 
assessment,  these  physicochemical  properties  necessary  for  traditional 
QSAR  model  development  may  not  be  known.  When  this  is  the  case,  it  is 
imperative  that  we  have  methods  that  make  use  of  nonempirical  parameters. 
One  of  the  fundamental  principles  of  biochemistry  is  that  activity  is  dictated 
by  structure  [9].  Following  this  principle,  one  can  use  theoretical  molecular 
descriptors  which  quantify  structural  aspects  of  the  molecular  structure 
[10-  27].  These  theoretical  descriptors  can  be  generated  directly  from  the 
molecular  structure  alone,  without  any  input  of  experimental  data. 

Topological  indices  (Tls)  are  numerical  graph  invariants  that  quantify 
certain  aspects  of  molecular  structure.  Tls  arc  sensitive  to  such  structural 
features  as  size,  shape,  bond  order,  branching,  and  neighborhood  patterns  of 
atoms  in  molecules  They  can  be  derived  from  simple  linear  graphs, 
multigraphs,  weighted  graphs,  and  weighted  pseudographs.  Tls  derived  from 
these  different  classes  of  graphs  will  encode  different  types  of  information 
about  molecular  architecture.  The  different  classes  of  Tls  provide  us  with 
nonempirical.  quantitative  descriptors  that  can  be  used  in  place  of  exp¬ 
erimentally  derived  descriptors  in  QSARs  for  the  prediction  of  properties. 

Our  recent  studies  have  focused  on  the  role  of  different  classes  of 
theoretical  descriptors  of  increasing  levels  of  complexity  and  their  utility  in 
QSAR  (28  31)  This  takes  the  form  of  a  hierarchical  approach  which 


PREDICTING  LCso  FROM  TOPOLOGICAL  DESCRIPTORS  1 19 

examines  the  relative  contributions  of  parameters  of  gradually  increasing 
complexity;  e.g.,  structural,  chemical,  shape  and  quantum  chemical 
descriptors;  in  estimating  physicochemical  and  biological  properties. 

In  this  paper  we  have  reported  the  utility  of  this  hierarchical  approach  in 
modeling  the  acute  aquatic  toxicity  (LC50)  of  a  congeneric  set  of  sixty-nme 
benzene  derivatives. 


THEORETICAL  METHODS 
Database 

Acute  aquatic  toxicity  [-logfLCso)]  in  fathead 

nrelas)  data  was  taken  from  the  work  of  Hall,  K.er  and  Phipps  [32]  Their 
data  was  compiled  from  eight  other  sources,  as  well  as  some  original  wor 
which  was  conducted  at  the  U.  S.  Environmental  Protection  Agency 
(USEPA)  Environmental  Research  Laboratory  m  Duluth,  Minnesota.  The 
complete  set  of  fathead  minnow  data  included  69  benzene  derivatives. 
According  to  the  authors,  the  set  of  benzene  derivatives  were  tested  using 
methodologies  which  were  comparable  to  their  96-hour  fathead  minnow 
loxiciiv  test  system.  The  derivatives  chosen  for  this  study  have  seven  di  - 
ferent  substituent  groups  that  are  all  present  in  at  least  six  of  the  inolecules. 
These  groups  consist  of  chloro,  bromo,  nitro,  methyl,  methoxyl,  hydroxy  , 
and  ammo  substituents  (Tab.  I). 


Computation  of  Indices 

Four  distinct  sets  of  theoretical  descriptors  have  been  used  in  this  study. 
These  sets  include  topostructural,  topochemical,  geometric,  and  quantum 
chemical  indices.  The  topostructural  and  topochemical  mdi«s  fall  into  t  e 
categorv  normally  grouped  together  as  topological  indices.  geometrica 
indices  arc  three-dimensional  Wiener  number  for  hydrogen-filled  mol^ular 
structure,  hydrogen-suppressed  molecular  structure,  and  van  der  Waals 

'^T^ostructural  indices  (TSIs)  are  topological  indices  which  only  encode 
information  about  the  adjacency  and  distances  of  atoms  (vertices)  in 
molecular  structures  (graphs),  irrespective  of  the  chemical  nature  of  he 
atoms  involved  in  bonding  or  factors  such  as  hybridization  states  and 
number  of  core/valence  electrons  in  individual  atoms.  Topochemical  indices 
(TCIs)  are  parameters  that  quantify  information  regarding  the  lopo  ogy 
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TABLE  I  Sixiy-ninc  bon7cno  derivatives  and  their  fathead  minnow  toxiciiies.  expressed  as 


-log  (LC5,,) 


No. 

Compound 

-hg(LCio) 

(ohs.) 

-log{LCio) 

(csl.  Eq.  4  ) 

Hi’.sidu(d 

1 

Benzene 

3.40 

3.42 

-0.02 

2 

Bromoben/enc 

3.89 

3.11 

0.12 

3 

Chlorobenzene 

3.77 

3.75 

0.02 

4 

Phenol 

3.51 

3.38 

0  13 

S 

Toluene 

3.32 

3.66 

-0,34 

6 

1,  2  -dichlorobenzene 

4.40 

4.29 

Oil 

7 

1.  3-dichloroben7ene 

4.30 

4.37 

-0.07 

8 

1 .  4-dichlorobenzenc 

4.62 

4.51 

O.ll 

9 

2-chlorophenol 

4.02 

3.79 

0  2.'< 

10 

3-chlorololuenc 

3.84 

3.88 

-0.04 

11 

4-chloroioluene 

4.33 

3.87 

0.46 

12 

1,  3-dihydroxyben/ene 

3.04 

3.43 

-0.39 

13 

3-hydroxyanisolc 

3.21 

3.33 

-0.12 

14 

2-methylphcnol 

3.77 

3.64 

0.13 

15 

3-mcthylphcno! 

3.29 

3.60 

-0.31 

16 

4-mcthylphcnoI 

3.58 

3.53 

0.05 

17 

4-nttrophenol 

3.36 

3.61 

-0.25 

18 

1,  4-dimcthoxybenzcne 

3.07 

3.28 

-0.21 

19 

1.  2-<limcthvlbenzene 

3.48 

3.93 

-0.45 

20 

1,  4-<iimcthvlbcnzcne 

4.21 

3.87 

0.34 

21 

2-nitrotoluenc 

3.57 

3.66 

-0  09 

22 

3-niirotoluene 

3.63 

3.53 

0.10 

2^ 

4-nitrotoluenc 

3.76 

3.49 

0.27 

24 

L  2-dinilrobenzcne 

5.45 

5.24 

0.21 

25 

1.  3-diniirobenzcnc 

4.38 

4.18 

0  20 

26 

L  4-di nitrobenzene 

5  22 

4.94 

0.28 

27 

2-mcthyl-3-nilroaniline 

3.48 

3.79 

-0.31 

28 

2-melhy  1-4- nit  roan  lime 

3  24 

3.51 

-  0  27 

29 

2-mcthyl-5-ntiroanilinc 

3.35 

3.68 

-  0  33 

30 

2-mclhyl-6-r.itroani)!ne 

3  80 

3.84 

-0  04 

31 

3-mclhyl-6-nit  roan  time 

3  80 

3,78 

0  02 

32 

4-methyi-2-nuroanilmc 

3  79 

3.80 

-0  01 

33 

4-hydroxy'3-nitroanilinc 

3  65 

3.61 

0  04 

34 

4-mcihyl-3-nitroaniline 

3  77 

3.73 

0.04 

35 

1.  2.  3-inchlorobcn7cne 

4  89 

4.89 

-0.00 

36 

1.  2.  4-inchlorobcn7cne 

5.00 

5.04 

-0.04 

37 

1.  3,  5-inchiorobcnzcnc 

4.74 

5.11 

-0  37 

38 

2,  4-dichlorophenol 

4  30 

4  33 

-0  03 

39 

3.  4-dichloroiolucnc 

4  74 

4.26 

0.48  ■ 

40 

2.  4-dichloroioluene 

4  54 

4.36 

0  18 

41 

4-chloro-3'mcih>  Iphenol 

4  27 

3.87 

0  40 

42 

2.  4-dimeth> Iphenol 

3  86 

3.76 

0  10 

43 

2.  6-dimeih> Iphenol 

3  75 

3.80 

-0  05 

44 

3,  4-dinieth> Iphenol 

3  90 

3.80 

0  10 

45 

2.  4-dinttrophenol 

4  04 

4,14 

-0  10 

46 

1,  2.  4-inmelh>lben/ene 

4.21 

4.09 

0  12 

47 

2.  3-dmitrotoluene 

501 

5.20 

0  19 

48 

2,  4-dinilroioluene 

3.75 

4.10 

-0.35 

49 

2.  5-dinitroioluenc 

5.15 

4.84 

0.31 

50 

2.  6-dinitron>lucnc 

3.99 

4  41 

0  42 

51 

3,  4-dmitrotoluene 

5  08 

5.11 

-0.03 
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TABLE  1  (Continued) 


52 

3.  5“dinitroioIucnc 

3.91 

4.05 

-0.14 

St 

1.  3.  5,-irinitrobcn7.cne 

5.29 

5..37 

-0.08 

54 

2'meihvl-3.  5-diniiroaniline 

4.12 

4.13 

-0.01 

55 

2-melhvl-3,  6-diniiroanilinc 

5. .34 

4.80 

0.54 

56 

3-meihyI-2,  4-diniiroaniline 

4.26 

4.28 

-0.02 

57 

5-meihyl-2.  4-diniiroaniline 

4.92 

4.14 

0.78 

58 

4-niethvl-2,  6-dinitroaniline 

4.21 

4.67 

—0.46 

5V 

5-mcthvl-2.  6-diniiroanilinc 

4.18 

4.80 

-0.62 

60 

4-nieihyl-3.  5-diniiroanilinc 

4.46 

4.34 

0.12 

61 

2,  4.  6-tribronnophenol 

4.70 

4.89 

-0.19 

62 

1 .  2.  3.  4-ieirachlorobenzcnc 

5.43 

5.62 

-0. 19 

6.^ 

1.  2.  4.  5-ieirachlorobenzenc 

5.85 

5.80 

0.05 

64 

2,4.  6-irichlorophenol 

4.33 

4.79 

-0.46 

65 

2-methyl-4.  6-dinitrophenol 

5.00 

4.21 

0.79 

66 

2.  3,  6-irinilrololuene 

6.37 

6.36 

O.OI 

67 

2,  4.  6-tnnitrotoIuene 

4.88 

5.16 

-0.28 

6S 

2.  3.  4,  5-icirachlorophenoi 

5.72 

5.36 

0.36 

69 

2.  3.  4.  5,  6-pcniachlorophcnol 

6.06 

6.03 

0.03 

(connectivity  of  atoms),  as  well  as  specific  chemical  properties  of  the  atoms 
comprising  a  molecule.  These  indices  are  derived  from  weighted  molecular 
graphs  where  each  vertex  (atom)  or  edge  (bond)  is  properly  weighted  with 
selected  chemical  or  physical  property  information.  Brief  definitions  of  the 
topological  indices  are  shown  in  Table  11. 


Topolof^ical  Indices 

The  102  topological  indices  used  in  this  study,  both  the  topostructural  and 
the  topochcmical,  have  been  calculated  by  POLLY  2.3  [33]  and  software 
developed  by  the  authors.  These  indices  include  Wiener  index  [34],  conn- 
cctivitv  indices  developed  by  Randic  [35]  and  higher  order  connectivity 
indices  formulated  by  Kier  and  Hall  [36],  bonding  connectivity  indices 
defined  by  Basak  et  ai  [37],  a  set  of  information  theoretic  indices  defined  on 
the  distance  matrices  of  simple  molecular  graphs  [38, 39]  and  neighborhood 
complexity  indices  of  hydrogen-filled  molecular  graphs  [40,41],  and 
Balaban's  y  indices  [42-44].  Table  III  provides  the  list  of  the  topostructural, 
topochcmical,  geometrical  and  quantum  chemical  indices  included  in  this 
stud> 


Geometrical  Indices 

Van  dcr  Waals  volume,  K„  (45-47],  was  calculated  using  Sybyl  6.1  from 
Trlpo^  .Associates.  Inc  (48]. The  3-/)  Wiener  numbers  were  calculated  by 
.Sr/)i/  using  an  SPL  (Sybyl  Programming  Language)  program  developed  in 
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'  D 

jW 
*  0 

W 

r 

Tc 

^ORB 

0 


M: 

/C 

S/C 


C/C 


/t 

A 


X 

X. 


XiAi 

\p. 


k.h 

Xf, 


J 


r 


TABLE  1!  Symbols  and  definitions  of  topological  and  geometrical  parameters 

Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of 
vertices  of  a  graph 

Mean  information  index  for  the  magnitude  of  distance 

Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of  a 
graph 

Degree  complexity 
Graph  vertex  complexity 
Graph  distance  complexity 

Information  content  of  the  distance  matrix  partitioned  by  frequency  of  occurrences 
of  distance  h 

Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its 
maximum  neighborhood  of  vertices 

Order  of  neighborhood  when  IC,  reaches  its  maximum  value  for  the  hydrogen-filled 
graph 

A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 
A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all  neighboring 
(connected)  vertices 

Mean  information  content  or  complexity  of  a  graph  based  on  the  =  0-5)  order 
neighborhood  of  vertices  in  a  hydrogen-filled  graph 

Structural  information  content  for  /^r  =  0-5)  order  neighborhood  of  vertices  in  a 
hydrogen-filled  graph 

Complemeniar\  information  content  for  r*^(r  =  0-5)  order  neighborhood  of 

vertices  in  a  hsdrogen -filled  graph 

Path  conncctiMU  index  of  order  h  ~  0~  6 

Cluster  connectiMis  index  of  order  /i  =  3,  5 

Cham  connectisitv  index  of  order  h  ~  6 

Path-Cluster  conncctisity  index  of  order  /i  =  4-6 

Bond  path  connectivity  index  of  order  >i  =  0-6 

Bond  cluster  connectivity  index  of  order  h  =  },  S 

Bond  chain  connectivity  index  of  order  h  -  6 

Bond  path-cluster  connectivity  index  of  order  A  =  4-6 

Valence  path  connectivity  index  of  order  h  -  0-6 

Valence  cluster  connectivity  index  of  order  /i  =  3,  5 

Valence  path-  cluster  connectivity  index  of  order  h  =  A-6 

Number  of  paths  of  length /i  =1-9 

Balaban's  J  index  based  on  distance 

Balaban's  J  index  based  on  bond  types 

Balaban's  J  index  based  on  relative  electronegativities 

Balaban's  J  index  based  on  relative  covalent  radii 


van  der  WaaU  volume 

3-/>  Wiener  number  for  the  hydrogen-suppressed  geometric  distance  matrix 
3-/)  Wiener  number  for  the  hydrogen-filled  geometric  distance  matrix 
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TABLE  III  Classification  of  parameters  used  in  developing  models  for  acute  aquatic  toxicity 


(LC50)  in  Pimephales  promelas 


Topological 

Topochemical 

Geometric 

Quantum  Chemical 

AMI 

*  D 

foRB 

jFhomo 

*  D 

/C0-/C5 

^HOMOI 

H' 

SICo-SICs 

^LUMO 

c 

ClC^i-CIC, 

^LUMOI 

//* 

A//f 

H" 

and 

P 

IZ' 

0 

Apt  Ape 

A/. 

M. 

and  ^x; 

-%c 

and  ‘x. 

p 

\fK  ~  \p*- 

J 

our  lab  [49],  Calculation  of  3-D  Wiener  numbers  consists  of  the  sum  entries 
in  the  upper  triangular  submatrix  of  the  topographic  Euclidean  distance 
matrix  for  a  molecule.  The  3-D  coordinates  for  the  atoms  were  determined 
using  CONCORD  3.0.1  [50].  Two  variants  of  the  3-D  Wiener  number  were 
calculated:  and  For  hydrogen  atoms  are  included  in 

the  computations  and  for  hydrogen  atoms  are  excluded  from  the 
computations. 


Quantum  Chemical  Parameters 

The  following  quantum  chemical  parameters  were  calculated  using  the 
Austin  Model  version  one  (AMI)  semi-empirical  Hamiltonian:  energy  of  the 
highest  occupied  molecular  orbital  (Ehomo)-  energy  of  the  second  highest 
occupied  molecular  orbital  (£homoi).  energy  of  the  lowest  unoccupied 
molecular  orbital  (Elumo)-  energy  of  the  second  lowest  unoccupied 
molecular  orbital  (£lumoi)i  heat  of  formation  (A//f),  and  dipole  moment 
(/i)  These  parameters  were  calculated  using  MOP  AC  6.00  in  the  SYBYL 
interface  [51]. 
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Data  Reduction 

Initially,  ail  topological  indices  were  transformed  by  the  natural  logarithm 
of  the  index  plus  one.  This  was  done  to  scale  the  indices,  since  some  may  be 
several  orders  of  magnitude  greater  than  others,  while  other  indices  may 
equal  zero.  The  geometric  indices  were  transformed  by  the  natural  logarithm 
of  the  index  for  consistency,  the  addition  of  one  was  unnecessary. 

The  set  of  eighty-one  topological  indices  was  then  partitioned  into  two 
distinct  sets,  the  topostructural  indices  (thirty-three)  and  the  topochemical 
indices  (forty-seven).  To  further  reduce  the  number  of  independent  variables 
for  model  construction,  the  sets  to  topostructural  and  topochemical  indices 
were  further  divided  into  subsets,  or  clusters,  based  on  the  correlation 
matrix  using  the  SAS  procedure  VARCLUS  [52].  This  procedure  divides 
the  set  of  indices  into  disjoint  clusters,  such  that  each  cluster  is  essentially 
unidimensional. 

From  each  cluster  we  selected  the  index  most  correlated  with  the  cluster, 
as  well  as  any  indices  which  were  poorly  correlated  with  their  cluster 
(/?*  <  0.70).  These  indices  were  then  used  in  the  modeling  of  the  acute 
aquatic  toxicity  of  benzene  derivatives  in  fathead  minnow.  The  variable 
clustering  and  selection  of  indices  was  performed  independently  for  both  the 
topostruciural  and  topochemical  indices.  This  procedure  resulted  in  a  set  of 
five  topostructural  indices  and  a  set  of  nine  topochemical  indices. 

Reducing  the  number  of  independent  variables  is  critical  when  attempting 
to  model  small  datasets.  The  smaller  the  dataset  is,  the  greater  the  chance  of 
spurious  error  when  using  a  large  number  of  independent  variables  (descrip¬ 
tors).  Topliss  and  Edwards  have  studied  this  issue  of  chance  correlations 
[53],  For  a  set  with  about  seventy  dependent  variables  (observations),  to 
keep  the  probability  of  chance  correlations  less  than  0.01,  we  can  use  at 
most  forty  independent  variables.  This  number  is  dependent  on  the  actual 
correlation  achieved  in  the  modeling  process,  with  a  high  correlation  we 
have  a  better  chance  of  using  more  variables  with  the  same  limited 
probability  of  chance  correlations.  In  this  study  we  are  well  below  the  cut-off 
of  forty.  In  fact,  the  total  number  of  descriptors  which  will  be  used  for 
model  construction  and  estimation  is  twenty-three,  well  within  the  bounds 
of  the  Topliss  and  Edwards  criteria  [53]. 

Statistical  Analysis  and  Hierarchical  QSAR 

Regression  modeling  was  accomplished  using  the  SAS  procedure  REG  on 
seven  distinct  sets  of  indices.  These  sets  were  constructed  as  part  of  a 
hierarchical  appro,ich  to  QSAR  model  development.  The  hierarchy  begins 
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with  the  simplest  parameters,  the  TSls.  After  using  the  TSls  to  model  the 
activity  the  next  level  of  complexity  is  added.  To  the  indices  included  in  the 
best  TSl  model,  we  add  all  of  the  TCIs  and  proceed  to  model  the  activity 
using  these  parameters.  Likewise,  the  indices  included  in  the  best  model 
from  this  procedure  are  combined  with  the  indices  from  the  next  level,  the 
geometrical  indices  and  modeling  is  conducted  once  again.  Finally,  the  best 
model  utilizing  TSls,  TCIs  and  geometrical  indices  is  combined  with  the 
quantum  chemical  parameters.  The  regression  analysis  results  in  the  final 
selection  of  indices  for  each  of  the  models.  The  remaining  three  models 
which  use  TCIs,  geometric,  and  quantum  chemical  parameters  indepen¬ 
dently  serve  as  a  means  of  validating  the  utility  of  the  hierarchical  approach 
and  the  need  for  varying  types  of  theoretical  descriptors. 


RESULTS 

The  variable  clustering  of  the  topostructural  indices  resulted  in  the  retention 
of  five  indices:  M,.7C,0,/’8,P9.  All-possible  subsets  regression  resulted  in 
the  selection  of  a  four-parameter  model  to  estimate  -logfLCso)  with  an 
explained  variance  (R^)  of 45.3%  and  a  standard  error  (s)  of  0.58.  While  this 
IS  an  unsatisfactory  model,  the  indices  will  still  be  retained  and  combined 
with  the  topochemical  indices  in  the  second  step  of  model  development. 
Table  IV  lists  the  indices  used  in  each  of  the  models. 

The  second  step  of  the  hierarchical  method  combined  the  four  indices 
used  in  the  first  tier  model  with  the  nine  topochemical  indices  selected  in  the 
variable  clustering  procedure;  S/Co,  S/Ci ,  S/C4,  C/Co,  C  x'pc' 

J,.  Again  all-possible  subsets  regression  was  conducted  resulting  in  a  four- 
parameter  model  with  an  explained  variance  {R^)  of  78.3%  and  a  standard 
crror(s)  of  0.36.  While  this  model  retained  two  parameters  from  the 
topostructural  model,  it  is  evident  that  the  addition  of  two  topochemical 
indices  made  a  significant  contribution  to  the  effectiveness  of  our  model. 

The  four  indices  from  the  second  tier  model  were  then  combined  with  the 
three  geometric  parameters:  .  The  resulting  model  from  this 

procedure  retained  four  indices,  replacing  the  topochemical  index  C/Cq  with 
the  geometric  parameter  This  model  had  an  explained  variance  {R  ) 

of  79.2®/o  and  a  standard  error  (5)  of  0.36. 

The  final  step  in  the  hierarchical  method  combined  the  four  parameters 
from  the  third  tier  model  with  the  quantum  chemical  (AMI)  parameters: 

A//,. Thi,  set  of  .e„  indices  led  .0  a 

seven-parameter  model  with  an  explained  variance  (/?  )  of  86.3 /»  and  a 
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Standard  error(s)  of  0.30.  This  model  retained  all  of  the  indices  from  the 
third  model  and  added  three  quantum  chemical  parameters. 

Three  other  models  were  constructed  for  the  purpose  of  comparison.  These 
include  a  five-parameter  topochemical  model,  a  three  parameter  geometric 
model,  and  a  four-parameter  quantum  chemical  model.  The  indices  used  in 
these  models  and  the  results  of  the  models  can  be  found  in  Table  IV. 


DISCUSSION 

The  goal  of  this  paper  was  to  investigate- the  utility  of  hierarchical  QSAR 
using  algorithmically  derived  molecular  descriptors  in  predicting  LC50 
values  for  a  set  of  sixty-nine  benzene  derives.  To  this  end,  we  used  four 
classes  of  parameters,  viz.,  topostructural  descriptors,  topochemical  indices, 
geometrical  descriptors  and  semiempirical  quantum  chemical  indices. 

It  is  clear  from  the  results  described  in  Table  IV  that  none  of  the 
individual  classes  of  parameters  correlate  well  with  acute  aquatic  toxicity. 
The  TSIs,  the  simplest  of  the  four  classes  of  parameters,  explained  about 
45%  of  the  variance  in  toxicity.  The  inclusion  of  topochemical  indices  in  the 
set  of  independent  variables  made  substantial  improvement  in  the  predictive 
capacity  of  the  QSAR  models.  This  is  understandable  since  the  benzene 
derivatives  analyzed  in  this  paper  comprise  a  fairly  congeneric  set,  and  while 
the  number  and  size  of  substituents  may  be  important,  the  chemical  nature 
of  the  substituents  also  plays  an  important  role  in  determining  the  overall 
toxicity  of  the  molecule.  This  is  shown  by  the  dramatic  increase  in  predictive 
power  between  Eqs.  1  and  2.  Equation  2  replaces  two  TSI  descriptors  with 
two  TCI  indices  that  arc  sensitive  to  the  atom  types  in  all  zero-order 
neighborhoods.  The  addition  of  this  basic  chemical  information  results  in  an 


TABLE  IV  Summary  of  the  regression  results  for  all  models  for  the  full  set  of  sixiy-nine 
benzene  dcrivattves 


Parameter  class 

Variables  Included 

F 

r" 

5 

1 

TSI 

A/,,  7c.  P9 

13.3 

0  453 

0  58 

TSI  ♦  T(T 

A/,.  Pc.  5/Co.  CICo 

57.9 

0  783 

0.36 

TSI  +  TCI  ^ 
Geometric 

M,.  P,.  S/Co,  ’"MV 

61.1 

0.792 

0.36 

4 

TSI  ♦  TCI  « 
Geometric  Quantum 
Chemical 

A/,.  P,,  SICo.  ’"MV 

i’lUMOI.  A//(.  /I 

55  0 

0.863 

0.30 

5 

TCI 

SIC„.  SIC,.  CICo. 

.34.3 

0.731 

0.41 

6 

Geometric 

//.  Ir ,  t  n 

34.8 

0.616 

0.48 

7 

Quantum  Chemical 

£^H0M01.  f^LUMO.  ^LUMOI.  /' 

23.8 

0.598 

0.50 
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TABLE  V  Calculated  values  for 
chemical  parameters  used  in  Eq.  4 


the  topostructural,  topochemical,  geometric  and  quantum 
(Tab.  IV)  _ _ 


No. 


A/i 


5/Co 


Elumoi 


AHf 


\ 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 


3 

3 

3 

3 

3 

4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

0 

2 

0 


0.246 

0.315 

0.315 

0.304 

0.227 

0.341 

0.341 

0.341 

0.362 

0.284 

0.284 

0.323 

0.295 

0.276 

0.276 

0.276 

0.376 

0.274 

0.213 

0.213 

0.341 

0.341 

0.341 

0.389 

0.389 

0.389 

0.344 

0.344 

0.344 

0.344 

0.344 

0.344 

0.376 

0.344 

0.349 

0.349 

0.349 

0.385 

0.312 

0.312 

0.326 

0.255 

0.255 

0.255 

0.383 

0.202 

0.365 

0.365 

0.365 

0.365 

0.365 

0.365 


5.21 

5.25 

5.25 

5.43 

5.79 

5.28 

5.28 

5.28 
5.46 

5.81 

5.82 
5.64 
6.16 
5.95 
5.97 
5.97 
5.84 
6.59 
6.22 

6.28 
6.11 
6  14 
6.15 
5.99 
6.01 
6.02 

6.38 
641 

6.41 

6.39 

6.42 
6.42 
6.15 
6.4) 
5-31 
5.31 
5.31 
5.49 
5.84 
5.84 
5.99 
6.40 
6.38 

6.38 
6.17 
6.64 
6.40 
6.43 
642 

6.39 
6  43 
6  44 


0.5540 

0.2447 

0.2632 

0.5095 

0.5745 

-0.0203 

-0.0462 

-0.0988 

0.2406 

0.2785 

0.3208 

0.3778 

0,4618 

0.5331 

0.5610 

0.4880 

-0.4095 

0,5766 

0.6180 

0.6450 

-0,2692 

-0.2921 

-0.2334 

-1.2793 

-1.5339 

-1.0875 

-0.1596 

-0.0919 

-0.1084 

-0.0006 

0.1022 

0.0314 

-0.2384 

-0.1379 

-0.3391 

-0.2761 

-0.3927 

-0.1034 

0.0251 

0.0006 

0.2063 

0-5006 

0.5503 

0.5387 

-1.5210 

0.6477 

-1.2262 

-1.4332 

-1.0421 

-1.4076 

-1.1564 

-1.4923 


22.0240 

26.7581 

14.8214 

-22.2334 

16.5004 

9.2203 

8.2544 

10.4661 

-28.6621 

7.1915 

7.1066 

-66.4516 

-59.9961 

-28.9297 

-29.6368 

-29.7869 

-19.5199 

-52.9350 

7.5221 

6.8236 

19.0823 

17.6145 

17.2948 

38.6210 

33.1466 

33.2941 

20.4489 

14.3213 

19.7541 

13.8471 

12.9086 

13.3128 

-15.9560 

18.0141 

4.2313 

2.9490 

2.2158 

-35.1296 

1.5862 

1.2199 

-36.1532 

-36.4200 

-35.5810 

-36.6403 

-8.7887 

-0.1093 

31.8226 

26.3804 

26.9397 

30.3487 

32.0703 

25.3294 


0.005 

1.449 

1.299 

1.233 

0.279 

1.974 

1.218 

0.000 

0.934 

1.478 

1.623 

2.433 

2.338 

0.960 

1.079 

1.333 

5.261 

2.424 

0.465 

0.003 

5.015 

5.443 

5.728 

7.804 

4.845 

0.013 

5.727 

7.434 

6.185 

5.374 

5.649 

5.280 

6.801 

5.596 

2-070 

1.033 

0.020 

0.395 

2.296 

1.464 

1.059 

1.052 

1.199 

1.229 

6.201 

0.274 

7.909 

5.390 

0.797 

3.639 

8.256 

5.321 
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TABLE  V  (Continued) 


No. 

M, 

5/Co 

Plumoi 

C^Hf 

5.1 

4 

0 

0.378 

6.33 

^2.522! 

44.8961 

0.032 

54 

4 

0 

0.362 

6.66 

-1.2453 

27.9172 

6.590 

55 

4 

0 

0.362 

6.65 

-0.6994 

25.1359 

3.166 

56 

4 

0 

0.362 

6.65 

-1.1532 

23.8377 

5.797 

57 

4 

0 

0.362 

6.67 

-1.3084 

51.2351 

7.196 

58 

4 

0 

0.362 

6.68 

- 1 .0204 

18.0757 

2.366 

59 

4 

0 

0.362 

6.66 

-1.0160 

54.7718 

3.199  . 

60 

4 

0 

0.362 

6.66 

-1.2172 

29.5227 

5.090 

61 

4 

0 

0.392 

5.54 

-0.4993 

2.2014 

1.096 

62 

4 

0 

0.341 

5.34 

-0.5585 

-0.5979 

1.616 

6.1 

4 

0 

0.341 

5.34 

-0.6587 

3.2072 

0.000 

64 

4 

0 

0.392 

5.52 

-OMll 

-38.2930 

1.083 

65 

4 

0 

0.362 

6.56 

-1.5102 

-19.8380 

4.669 

66 

4 

2 

0.365 

6.66 

-1.9189 

46.0695 

3.518 

bl 

4 

0 

0.365 

6.67 

-2.3240 

41.4239 

1.418 

68 

4 

0 

0.385 

5.54 

-0.5526 

-43.2613 

1.231 

improvement  in  the  model.  A  similar  conclusion  is  borne  out  from  the 
QSAR  analysis  of  the  same  set  of  benzene  derivatives  reported  by  Hall  et  at. 
where  they  found  that  the  chemical  nature  of  the  substituent  is  important  in 
determining  toxicity  (32). 

In  the  next  tier,  Eq.  3  replaces  one  of  the  information  content  indices  with 
the  three-dimensional  Wiener  number,  a  descriptor  that  characterizes  the 
three-dimensional  aspects  of  molecular  shape  and  size.  This  leads  to 
refinement  of  the  model  developed  in  Eq.  2.  Finally,  the  addition  of  the 
quantum  chemical  parameters;  energy  of  the  second  lowest  unoccupied 
molecular  orbital,  heat  of  formation,  and  dipole  moment;  leads  to  a  marked 
improvement  in  the  predictive  power  of  the  model  (Eq.  4). 

As  can  be  seen  from  Eqs.  1  and  5-7  (Tab.  IV),  none  of  the  four  classes  of 
indices  do  very  well  individually.  The  hierarchical  QSAR  approach  using 
four  classes  of  parameters  resulted  in  acceptable  predictive  models  (Eq.  4). 
We  may  conclude  from  the  results  presented  in  this  paper  that  each  of  the 
four  classes  of  theoretical  descriptors  that  were  used  are  necessary  for  the 
development  of  good  QSARs  for  the  acute  aquatic  toxicity  of  benzene 
derivatives  in  fathead  minnow. 
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Numerous  quantitative  structure-activity  relationships  (QSARs)  have  been  developed  using  topostructural 
topochemical.  and  geometrical  molecular  descriptors.  However,  few  systematic  studies  have  been  cabled 
out  on  the  relative  effectiveness  of  these  three  classes  of  parameters  in  predicting  properties.  We  have 
carried  out  a  systematic  analysis  of  the  relative  utility  of  the  three  types  of  structural  descnpt^  in  developing 
OSAR  models  for  predicting  vapor  pressure  at  STP  for  a  set  of  476  diverse  chemicals.  The  hierarchical 
technique  has  proven  to  be  useful  in  illuminating  the  relationships  of  different  types  of  molecular  description 
informauon  to  physicochemical  property  and  is  a  useful  tool  for  limiting  die  number  of  independent  vanables 
in  linear  regression  modeling  to  avoid  the  problems  of  chance  correlations. 


1.  INTRODUCTION 


A  large  number  of  quantitative  suiicture-activity  relation¬ 
ship  (QSAR)  studies  have  been  reported  in  recent  literature 
using  theoretical  molecular  descriptors  in  predicting  physi¬ 
cochemical,  pharmacological,  and  toxicological  properties 
of  molecules.'"”  Such  descriptors  comprise  graph  invari¬ 
ants.  geometrical  or  3-D  parameters,  and  quantum  chemical 
indices.  One  of  the  reasons  for  the  current  upsurge  of  interest 
is  the  fact  that  such  descriptors  can  be  derived  algorithmi¬ 
cally.  i.e..  can  be  computed  for  any  molecule,  real  or 
hypothetical,  using  standard  software.  Both  in  pharmaceuti¬ 
cal  drug  design  and  in  risk  assessment  of  chemicals,  one 
has  to  evaluate  jxiteniial  biological  effects  of  chemicals 
Evaluation  schemes  based  on  property-property  correlation 
paradigms  are  not  very  useful  in  practical  situations,  because, 
for  most  of  the  candidate  structures,  the  experimental  data 
necessary  for  proper  evaluation  are  not  available.  This  is 
especially  uue  for  the  thousands  of  chemicals  rapidly 
produced  by  methods  of  combinatoric  chemistry'*  as  well 
as  for  the  large  number  of  chemicals  present  in  the  Toxic- 
Substances  Control  Act  (TSCA)  Inventory.'’ 


A  large  number  of  physicochemical  and  biological  end¬ 
points  are  necessary  for  estimating  the  ecoloxicological  fate, 
u-ansport.  and  effects  of  environmental  pollutants.”"’''  The 
vapor  pressure  of  chemicals  is  important  in  determining  the 
partitioning  of  chemicals  among  different  phases  once  they 
are  relea.sed  in  the  environment.  Many  QSARs  have  been 
reported  for  predicting  normal  vapor  pressure  of  chemicals. 
Such  studies  are  usually  carried  out  on  small  sets  of 
congenenc  chemicals.  Also,  many  QSARs  use  experimental 
data  as  inputs  in  the  model.  Therefore,  it  becomes  necessary 
to  develop  QSARs  based  on  noncmpirical  parameters  which 
can  predict  the  vapor  pressure  for  a  heterogeneous  collection 
of  chemicals  so  that  such  models  are  generally  applicable. 
With  this  end  in  mind,  in  the  current  paper  we  have  carried 
out  a  QSAR  study  of  476  diverse  chemicals  using  three  types 
of  nonempirical  molecular  descriptors. 
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2.  MATERIALS  AND  METHODS 

2.1.  Normal  Vapor  Pressure  Database.  Measured 
values  for  a  subset  of  the  Toxic  Substances  Control  Act 
(TSCA)  Inventory”  were  obtained  from  the  ASTER  (As¬ 
sessment  Tools  for  the  Evaluation  of  Risk)  database.^  This 
subset  consisted  of  a  diverse  set  of  chemicals  where  vapor 
pressure  (pvap)  measured  at  25  °C  and  over  a  pressure 
range  of  approximately  3—10  000  mmHg.  Due  to  the  size 
of  the  dataset  being  used  in  this  study,  data  for  these 
chemicals  will  not  be  listed  in  this  paper.  An  electronic  copy 
of  the  data  may  be  obtained  by  contacting  the  authors. 

2.2.  Computation  of  Topological  Indices.  The  majority 
of  the  topological  indices  (Tls)  used  in  this  study  have  been 
calculated  by  the  computer  program  POLLY  2.3.-'  These 
indices  include  Wiener  index.^^  the  molecular  connectivity 
indices  developed  by  Randic  and  Kier  and  Hall.'  ”  informa¬ 
tion  theoretic  indices  defined  on  distance  matrices  of 
graphs,”  ”  and  a  set  of  parameters  derived  on  the  neighbor¬ 
hood  complexity  of  vertices  in  hydrogen-filled  molecular 
graphs.’-^"”  Balaban’s  J  indices”"”  were  calculated  using 
software  developed  by  the  authors. 

van  der  Waal’s  volume  (V*)^^"”  was  calculated  using 
Sybyl  6.2.”  The  3-D  Wiener  numbers”  were  calculated  by 
Sybyl  using  an  SPL  (Sybyl  Programming  Language)  program 
developed  by  the  authors.  Calculation  of  3-D  Wiener 
numbers  consists  of  the  summation  of  the  entries  in  the  upper 
triangular  submatrix  of  the  topographic  Euclidean  distance 
matrix  for  a  molecule.  The  3-D  coordinates  for  the  atoms 
were  determined  using  CONCORD  3.2.1.”  Two  variants 
of  the  3-D  Wiener  number  were  calculated,  and  ®IV, 
where  hydrogen  atoms  are  included  and  excluded  from  the 
computations,  respectively. 

Table  I  provides  a  complete  listing  of  all  of  the  topological 
and  geometrical  parameters  which  have  been  used  in  this 
study.  The  listing  includes  the  symbols  used  to  represent 
the  parameters  and  brief  definitions  for  each  of  the  param¬ 
eters. 

Two  additional  parameters  were  used  in  modeling  normal 
vapor  pressure,  HB|,  and  dipole  moment  (jj.).  HBi  is  a 
simple  hydrogen  bonding  parameter  calculated  using  a 
program  developed  by  Basak,”  which  is  based  on  the  ideas 


/  N'  f'f'f'  C  t  <  fUi 


'T'  ti)0"T  A  r*-».vru'Tn  Ph>»mirnl  Snrif'tv 


652  y.  Chem.  Inf,  Comput.  ScL  VoL  37.  No,  4,  1997 


Basak  et  al 


Table  1.  Symbols  and  Definitions  of  Topological  and  Geometrical 
Parameters  _ _ 

information  index  for  the  magnitudes  of  distances 
between  all  possible  pairs  of  vertices  of  a  graph 
mean  information  index  for  the  magnitude  of  distance 

Wiener  index  =  half-sum  of  the  off-diagonal  elements 
of  the  distance  matrix  of  a  graph 
degree  complexity 
graph  vertex  complexity 
graph  distance  complexity 

information  content  of  the  distance  matrix  partitioned  by 
frequency  of  occurrences  of  distance  h 
information  content  or  complexity  of  the  hydrogen- 
suppressed  graph  at  its  maximum  neighborhood  of 
vertices 

order  of  neighborhood  when  ICr  reaches  it  maximum 
value  for  the  hydrogen-filled  graph 
a  Zagreb  group  parameter  =  sum  of  square  of  degree 
over  all  vertices 

a  Zagreb  group  parameter  =  sum  of  cross-product  of 
degrees  over  all  neighboring  (connected)  vertices 
mean  information  content  or  complexity  of  a  graph 

based  on  the  r^{r  =  0—5)  order  neighborhood  of  vertices 
in  a  hydrogen-filled  graph 

structural  information  content  for  rth  (r  ==  0-5)  order 
neighborhood  of  vertices  in  a  hydrogen-filled  graph 
complmeniary  information  content  for  rth  (r  =  0-5) 

order  neighborhood  of  vertices  in  a  hydrogen-filled  graph 
path  connectivity  index  of  order  h  =  0-6 
cluster  connectivity  index  of  order  h  =  3—6 
path-cluster  connectivity  index  of  order  h  =  4—6 
chain  connectivity  index  of  order  /i  =  5,  6 
bond  path  connectivity  index  of  order  h  =  0-6 
bond  cluster  connectivity  index  of  order  /i  =  3-6 
bond  chain  connectivity  index  of  order  h  =  5,6 
bond  path-cluster  connectivity  index  of  order  h  —  4-6 
valence  path  connectivity  index  of  order  h  =  0-6 
valence  cluster  connectivity  index  of  order  /i  =  3— 6 
valence  chain  connectivity  index  of  order  h  =  5,6 
valence  path-cluster  connectivity  index  of  order  h  —  A—6 
number  of  paths  of  length  h  =  0— 10 
Balaban’s  J  index  based  on  distance 
Balaban's  J  index  based  on  bond  types 
Balaban’s  J  index  based  on  relative  electronegativities 
Balab?n‘s  J  index  based  on  relative  covalent  radii 
van  der  Waafs  volume 

3-D  Wiener  number  for  the  hydrogen-suppressed 
geometric  distance  matrix 

3  D  Wiener  number  for  the  hydrogen-filled  geometric 
distance  matnx 

of  Ou  et  Dipole  moment  was  calculated  using  Sybyl 
6.2 

2.3.  Data  Reduction.  The  set  of  92  TIs  was  partitioned 
into  two  distinct  subsets;  topostruclural  indices  and  to- 
pochemical  indices.  The  distinction  was  made  as  follows: 
topostruclural  indices  encode  information  about  the  adjacency 
and  distances  of  atoms  (vertices)  in  molecular  structures 
(graphs)  irrespective  of  the  chemical  nature  of  the  atoms 
involved  in  the  bonding  or  factors  like  hybridization  states 
of  atoms  and  number  of  core/valence  electrons  in  individual 
atoms,  while  topochemical  indices  quantify  information 
regarding  the  topology  (connectivity  of  atoms)  as  well  as 
specific  chemical  properties  of  the  atoms  comprising  a 
molecule.  Topochemical  indices  are  derived  from  weighted 
molecular  graphs  where  each  vertex  (atom)  is  properly 
weighted  with  selected  chemical/physical  properties.  These 
subsets  are  shown  in  Table  2. 

The  partitioning  of  the  indices  left  38  topostruclural  indices 
and  54  topochemical  indices.  Al  this  point  no  further  data 
reduction  is  called  for,  since  the  ratio  of  the  number  of 
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Table  2.  Classification  of  Parameters  used  in  Modeling  Normal 
Vapor  Pressure  [logio(pv*p)] 


topological 

topochemical 

geometric 

Other  parameters 

/orb 

Vw 

HB, 

I 

IC0-IC5 

3DW 

P 

W 

SICo-SICs 

’“W'H 

P 

CICo-CICs 

Y-V 

fP 

rP 

I 

> 

IC 

and  Vci. 

0 

Ypc“Ypc 

Mi 

0j,._6^v 

Ml 

Yc-Yc 

and  Ych 

^Xc-^Xc 

Ypc-Yfc 

^Xch  and  ^xch 

P 

*Xpc-^Xpc 

P 

Po’^PiO 

r 

J 

observations  in  the  training  set  (342)  to  the  total  number  of 
variables  (92  maximum)  falls  well  within  the  condition  limits 
suggested  by  Topliss  and  Edwards'^®  for  reducing  the 
probability  of  spurious  correlations  even  at  the  more 
conservative  >  0.7  level. 

2.4.  Statistical  Analysis  and  Hierarchical  QSAR. 
Initially,  all  TIs  were  transformed  by  the  natural  logarithm 
of  the  index  plus  one.  This  was  done  since  the  scale  of  some 
indices  may  be  several  orders  of  magnitude  greater  than  that 
of  other  indices.  The  geometric  parameters  were  transformed 
by  the  natural  logarithm  of  the  parameter. 

Two  regression  procedures  were  used  in  developing  the 
linear  models.  When  the  number  of  independent  variables 
was  high,  typically  greater  than  25,  a  stepwise  regression 
procedure  was  used  to  maximize  the  improvement  of  the 
explained  variance  (R^).  When  the  number  of  independent 
variables  was  smaller,  all  possible  subsets  regression  was 
used.  Models  were  then  optimized  to  reduce  problems  of 
variance  inflation  and  collinearily.  Regression  modeling  was 
conducted  using  the  REG  procedure  of  the  statistical  package 
SAS."‘ 

The  vapor  pressure  data  (/?vap)  was  split  into  a  training  set 
(342  compounds)  and  a  test  set  (134  compounds),  an 
approximately  75/25  split.  Models  were  developed  using 
the  training  set  of  chemicals  and  then  used  to  predict  the 
Pvap  values  of  the  lest  chemicals.  Final  models  were  then 
developed  using  the  combined  training  and  test  set  of 
chemicals. 

Five  sets  of  indices  were  used  in  model  development. 
These  sets  were  constructed  as  part  of  a  hierarchical  approach 
to  QSAR  modeling.  The  hierarchy  begins  with  the  simplest 
indices,  the  loposUiictural.  After  developing  our  initial  model 
utilizing  the  topostruclural  indices,  we  increase  the  level  of 
complexity.  To  the  indices  included  in  the  best  topostruclural 
model,  we  add  all  of  the  topochemical  indices  and  proceed 
to  model  pvap  using  these  parameters.  Likewise,  the  indices 
included  in  the  best  model  from  this  procedure  are  combined 
with  the  geometrical  indices  and  modeling  is  conducted  once 
again.  In  addition  to  this  hierarchical  approach,  models  were 
also  constructed  using  the  topochemical  indices  alone  and 
the  geometrical  indices  alone  for  purposes  of  comparison. 

3.  RESULTS 

Stepwise  regression  analyses  for  logio(/?vap)  of  the  training 
set  of  chemicals  is  summarized  in  Table  3.  As  shown  in 
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Table  3.  Summary  of  the  Regression  Results  for  the  Training  Set  and  the  Prediction  Results  for  the  Test  Set  for  the  Hierarchical  Analysis  of 


_ _ _ _ _ _ _ _ _ 
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training  set  {N  =  342) 

test  set  (A^  = 

134) 

parameter  class 

variables  included 

F 

s 

R- 

5 

topostructural 

topochemical 

geometrical 

topostructural  +  topochemical 
all  indices 
ttg  *+■  HBj  4*  fi 

P9 

SICo,  SIC2.  SICj,  CICo.  CIC,.  Yc,  ‘x'.  h'C’  J'‘ 

JOW',  3D)Vh,  Vw 

‘x.  A,  iCi,  SIC2.  ciCi.  Yc.  'x'.  h'-  Y-  Yc.  Ych 

AF.  SICi,  SIC2.  CICo,  CIC^,  “'Xc.  ’x“.  h'.  Yc.  A.  P\0 
'x,  Pi,  Po.  iCo,  Y-  Yc.  Y-  Y-  Yc.  HB , 

104.6 

126.3 
168.9 
112.5 

117.4 
160.8 

48.1 

79.2 

51.8 
80.4 
79.6 

82.9 

0.56 

0.36 

0.53 

0.35 

0.35 

0.32 

57.9 

85.8 

62.2 

84.7 

84.2 

83.1 

0.46 

0.27 

0.44 

0.28 

0.28 

0.29 

Table  3,  the  topostructural  model  using  three  parameters 
resulted  in  an  explained  variance  (R^)  of  48.1%  and  a 
standard  error  (s)  of  0.56.  Addition  of  the  topochemical 
parameters  to  the  three  topostructural  parameters  led  to  a 
significant  increase  in  the  effectiveness  of  the  model.  The 
resulting  model  used  12  parameters,  two  topostructural  and 
ten  topochemical.  This  model  had  an  of  80.4%  and  s  of 
0.35.  All  subsets  regression  of  the  two  topostructural  and 
ten  topiochemical  indices  retained  thus  far  and  the  three 
geometrical  indices  resulted  in  the  selection  of  the  same  12 
parameter  model,  thus  the  geometrical  indices  did  not 
contribute  significantly  to  model  development.  Several  other 
models  were  constructed  for  comparative  purposes.  Using 
topochemical  indices  only,  a  ten  parameter  model  was 
developed  which  had  an  R-  of  79.2%  and  s  of  0.36.  A 
geometrical  model  was  developed  which  utilized  all  three 
geomeuical  indices  and  resulted  in  an  R^  of  51.8%  and  s  of 
0.53.  Finally,  two  additional  stepwise  models  were  devel¬ 
oped.  One  model  simply  used  all  indices  for  a  comparison 
between  a  simple  stepwise  analysis  of  the  data  and  the  results 
of  the  hierarchical  procedure.  This  resulted  in  an  11 
parameter  model  with  /?*  of  79.6%  and  s  of  0.35.  The  second 
model  added  two  new  parameters,  HBi  and  p.  We  thought 
that  it  might  be  possible  to  improve  our  modeling  by  adding 
in  some  other  nonempirical  parameters  which  could  be 
important  to  the  determination  of  normal  vapor  pressure  We 
selected  the  parameters  HB|  and  p.  since  they  would  be 
important  in  intermolecular  interactions  which  could  have  a 
dramatic  effect  on  vapor  pressure.  To  look  at  the  addition 
of  these  parameters,  we  conducted  a  stepwise  regression 
analysis  using  all  topostructural.  topochemical.  and  geometric 
indices  so  that  we  would  be  able  to  optimize  our  model, 
just  as  we  had  done  with  the  previous  models.  The  addition 
of  these  parameters  led  to  the  selection  of  a  ten  parameter 
model  which  included  three  topostructural  indices,  nine 
lofxichemical  indices,  and  HB|.  This  was  the  best  model 
yet.  with  an  R'  of  82.9%  and  s  of  0.32. 

Application  of  these  six  models  to  the  test  set  of  chemicals 
resulted  in  comparable  R^  and  s;  actually  all  models  improved 
slightly  on  their  predictions  of  the  test  set,  and  these  values 
are  also  listed  in  Table  3.  Based  on  these  results,  we  decided 
that  it  was  pointless  to  develop  further  models  using  only 
geometrical  parameters.  Also,  based  on  the  findings  that 
the  geometrical  indices  did  not  contribute  significantly  to 
any  of  the  training  models,  they  were  dropped  from  the 
development  of  final  models  for  the  full  set  of  476  chemicals. 
However,  even  though  the  topostructural  indices  did  not 
perform  well  in  modeling  vapor  pressure  by  themselves,  they 
will  be  used  in  model  development  since  they  did  contribute 
significantly  to  most  of  the  models. 

Regres.sion  analyses  of  the  combined  set  of  476  chemicals 
showed  similar  results  for  e.slimaiing  logio(pvap)  as  analysis 


of  the  training  set.  Using  only  the  topostructural  indices, 
stepwise  regression  analysis  resulted  in  a  five  parameter 
model  to  estimate  vapor  pressure; 

log|o(Pvap)  =  4.88  +  0.20(0)  -  2.56(‘x)  +  0.49(‘'xc)  + 

0.79(‘’zc)  +  0.98(P,o)  (1) 

„  =  476.  R^  =  5l.5%.  j  =  0.53,  F  =  99.7 

Stepwise  regression  using  the  five  topostructural  param¬ 
eters  and  all  topochemical  parameters  resulted  in  the  selection 
of  the  following  seven  parameter  model: 

log,o(Fvap)  =  8-44  -  1.77('x)  +  1.25(P,o)  -  5.69(IC,)  + 
3.91(IC2)  -  1.24(IC5)  +  1.41(yc)  "  l-70('f )  (2) 

„  =  476,  /?^  =  79.3%.  r  =  0.34,  F  =  224.0 

Only  two  of  the  topostructural  indices  used  in  eq  1  were 
retained  by  the  stepwise  regression  procedure  used  to  produce 
eq  2:  ‘x  The  improvement  in  was  significant, 

increasing  from  51.5%  for  eq  1  to  79.3%  for  eq  2.  Also, 
the  model  error  decreased  significantly,  dropping  by  0.19 
logarithmic  units.  Since  we  have  dropped  the  geometrical 
indices,  this  becomes  our  final  hierarchical  model. 

The  stepwise  regression  analysis  of  only  topochemical 
parameters  resulted  in  a  12  parameter  model: 

logioCPvap)  “  ”  3.44(ICo)  -  1.33(IC5)  + 

3.47(S1C,)  +  0.87(CIC,)  -  0.48(Y)  +  1.44(yc)  “ 
l.OOiY)  -  0.41(V)  -  0.70(V)  -  1.08(yc)  + 

l  -42(ych)  -  1-23(J'')  (3) 

„  =  476,  R^  =  75.8%,  J  =  0.38,  F  =  120.5 

Thi.s  model  which  is  inferior  to  the  topostructural  -I- 
topochemical  model  (eq  2),  because  its  variance  explained 
is  lower  and,  more  importantly,  it  requires  more  independent 
variables  (parameters)  to  achieve  this  explanation  of  variance. 

Stepwise  regression  of  all  indices  resulted  in  the  selection 
of  an  11  parameter  model.  This  approach  selected  three 
topostructural  indices  and  eight  topochemical  indices  to  arrive 
at  the  following  model; 

log,o(Fvap)  =  7-85  -  2.56(H'')  +  1.17('xc)  " 

5.01(1C,)  +  3.65(IC2)  -  0.99(IC5)  +  0.51(CIC,)  - 

1.54('f )  -  0.36(V)  -  0.36(y)  -  1.40(Vc)  (4) 
n  =  476,  R^  =  80.4%,  s  =  0.33,  F  =  173.4 
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Estimated  log,o(Pv,p) 


Figure  1.  Scaiterplot  of  observed  logio(/?vap)  vs  estimated  logio- 
using  eq  5  for  476  diverse  compounds. 


While  eq  4  shows  some  slight  improvements  over  eq  2, 
the  hierarchical  model,  eq  2  is  preferred  since  it  is  a  simpler 
model  using  seven  indices  instead  of  1 1  and  based  on  a 
companson  of  F  values  it  is  a  more  robust  model  than  that 
in  eq  4 

Finally,  we  conducted  the  stepwise  regression  modeling 
using  all  toposiructural  and  topochemical  indices  with  HBi 
and  u  for  the  complete  set  of  476  chemicals.  The  resulting 
ten  parameter  model  used  three  topostructural  indices,  six 
topcMThemical  indices,  and  HB|: 

log„.<A.r‘  =  ■  3.66(';^)  +  OJSCPj)  +  0.74(P,)  - 

1  78(IC„)  -  3.33{S1C,)  -  0.81(C1C)  +  2.05(y)  - 
1.73(V)  -  0.79(V)  -  0.29{HB,)  (5) 

„  =  476.  R-  =  84.3%,  i  =  0.29,  F  =  249.5 

Equation  5  shows  marked  improvement  over  eq  2, 
justifying  the  addition  of  indices  to  the  model.  Also,  it  meets 
the  cniena  on  which  eq  4  was  judged  to  be  lacking.  Overall, 
there  is  an  improvement  in  variance  explained  of  5%,  with 
a  comparable  decrease  in  standard  deviation.  A  scatter  plot 
of  observed  log,o(pvap)  versus  estimated  logio(pvap)  using  eq 
5  IS  presented  in  Figure  1. 

4.  DISCUSSION 

The  purpose  of  this  paper  was  2-fold:  (a)  to  study  the 
utility  of  algorithmically-derived  molecular  descriptors  in 
developing  QSAR  models  for  predicting  the  vapor  pressure 
of  chemicals  from  structure  and  b)  to  investigate  the  relative 


Table  4.  Summary  of  the  Chemical  Class  Composition  of  the 
Normal  Vapor  Pressure  Dataset 


compd  classification 

no.  of  compds 

pure 

substituted 

total  normal  vapor  pressure  dataset 
hydrocarbons 
non-hydrocarbons" 
nitro  compounds 

476 

253 

223 

4 

3 

1 

3 

amines 

20 

17 

nitriles 

7 

6 

1 

ketones 

7 

7 

0 

halogens 

100 

95 

5 

anhydrides 

1 

1 

0 

esters 

18 

16 

2 

carboxylic  acids 

2 

2 

0 

alcohols 

10 

6 

4 

sulfides 

39 

38 

1 

thiols 

4 

4 

0 

imines 

2 

2 

0 

epoxides 

1 

15 

1 

10 

0 

aromatic  compounds* 

4 

fused-ring  compounds'^ 

1 

I 

0 

°  The  non-hydrocarbons  are  further  broken  down  into  the  following 
groups.  ^The  15  aromatic  compounds  are  a  mixture  of  11  aromatic 
hydrocarbons  and  four  aromatic  halides.  The  only  fused-ring  com¬ 
pound  was  a  polycyclic  aromatic  hydrocarbon. 


roles  of  topostructural,  topochemical,  and  geometrical  indices 
in  the  estimation  of  standard  vapor  pressure. 

Results  described  in  this  paper  (eqs  1—5)  show  that 
nonempirical  parameters  derived  predominantly  from  graph 
theoretic  models  of  molecules  can  estimate  normal  vapor 
pressure  of  diverse  chemicals  reasonably  well.  The  ex¬ 
plained  variance  of  data  (R-  =  84.3%)  is  excellent  in  view 
of  the  fact  that  the  database  of  chemicals  analyzed  in  this 
paper  is  very  diverse  (see  Table  4).  It  should  be  mentioned 
that  most  published  QSAR  models  for  the  estimation  of  vapor 
pressure  have  dealt  with  much  smaller  data  sets  with  limited 
structural  variety 

The  relative  effectiveness  of  topostructural,  topochemical, 
and  geometrical  indices  in  predicting  normal  vapor  pressure 
of  chemicals  is  evident  from  the  result  presented  above. 
Equation  1  explains  over  51%  of  variance  in  the  data.  All 
parameters  used  to  derive  eq  1  are  topostructural,  i.e.,  they 
are  parameters  which  encode  information  about  the  adjacency 
and  distance  of  vertices  in  skeletal  molecular  graphs  without 
quantifying  any  explicit  information  about  such  chemical 
aspects  like  bond  order,  electronic  character  of  atoms,  etc. 
Yet,  the  high  explained  variance  of  the  property  indicates 
that  adjacency  and  distance  in  chemical  graphs,  being  general 
descriptors  of  molecular  size,  shape,  and  branching,  are 
important  in  predicting  properties.  This  may  explain  the 
success  of  parameters  like  simple  connectivity  indices  in 
estimating  many  diverse  properties.* 

Equation  3  is  derived  only  from  topochemical  indices.  The 
explained  variance  of  vapor  pressure  (75.8%)  shows  that 
topochemical  parameters,  as  a  class,  explain  a  larger  fraction 
of  the  variance  as  compared  to  models  derived  from  only 
topostructural  indices  (eq  1).  Geometrical  parameters  were 
dropped  from  the  set  of  descriptors  after  their  limited  success 
in  prediction  for  the  training  and  test  sets.  This  is  in  line 
with  our  earlier  studies  with  normal  boiling  point  and 
hydrophobicity,  where  it  was  reported  that  the  addition  of 
geometrical  indices  could  not  significantly  improve  the 
predictive  power  of  QSAR  models  derived  from  a  combined 
set  of  topostructural  and  topochemical  parameters.**^  It  would 


Topostructural,  Topochemical,  and  Geometric  Parameters 

be  interesting  to  see  whether  this  pattern  holds  good  for  other 
properties  as  well.  Finally,  the  addition  of  the  simple 
nonempirical  parameter,  HBi,  which  contains  information 
relevant  to  intermolecular  interactions  further  improves  our 
ability  to  estimate  normal  vapor  pressure  resulting  in  an 
explained  variance  of  84.3%  (eq  5). 
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